Fili-busted!

Darius Fuller
15 min readSep 23, 2020

Part 4: Predicting the vote

From: Netclipart

This post will cover the last component of my supervised machine learning project in which I used web scraping to create a data set on the candidates in United States Senate general elections starting in 1920 through 2016.

In part 3 of this series, I went over what kind of information I was able to glean from my data set using traditional Exploratory Data Analysis (EDA) techniques with the Pandas and Matplotlib Pyplot packages. I was able to confirm some of my assumptions about how one win can set the foundation for successful campaigns in the future and how membership in particular parties can effect performance in the polls.

Keeping in line with the completion process of this project, next I will discuss how I re-purposed the data for use in a machine learning model to predict how many votes a candidate would receive.

Last Time…

In the previous post, I went into a fair bit of detail about the features in the final DataFrame I used for EDA purposes. For clarity, I will only explain the eight features that I used for the regression (down from 11) in this post, but feel free to jump back if you want more information.

A few of the things I observed during my EDA:

  • The most frequent first name for a U.S senator is John (surprise?)
  • Participation of Third-Party candidates fluctuate over time, with some vanishing completely
  • Before incumbency, a candidate’s party matters the most

The Big Question

I will be taking these ideas forward in conjunction with my regression results to see if any further insights can be found while simultaneously addressing my initial question:

Can I predict the winner of a given U.S. Senate election using a data set scraped from Wikipedia?

Prep Time!

In order to answer this question, I needed to modify my data set for use in a regression model. There are a lot of great resources for data cleaning best practices (such as this post by Omar Elgabry) but I will share some of the things I needed to do for this project:

  • Fill missing values (handled during data assembly)
  • Group party membership before encoding (handled during data assembly)
  • Re-interpret “seats up” and “seat before” variables as percentages (handled during data assembly)
  • Create one-hot encoded categorical variables
  • Numerically encode election year

Filling missing values

In a very broad sense, there are two strategies to address data points with missing values: fill it or drop it. I primarily opted to either fill missing values during the cleaning process for my model because any non-essential data points were removed during my initial assembly of the data set. Just like snowflakes, every data set is different and thus there is no single strategy that will efficiently clean any data set. So make sure to spend time in exploring your data and its underlying topic as often background knowledge can be a key factor in determining the cleaning strategy for a data set!

One example of such a decision is how I decided to fill in the missing turnout values. Due some variances in the source HTML, my scraping functions left missing values in some of the elections. In total, there were 17 years that I needed to address, each year having at least one state senate election where one or more candidates are without a turnout number. My thoughts to handle them were to:

  • Take the closest prior and future cycle’s number and take some sort of midpoint or average
  • Enter a numerical placeholder such as 0, expressing the lack of data in a way that can still be used in a regression
  • Use the ‘ol reliable Google search for each election’s numbers and enter them individually

Against my better judgement, I took the option with the highest time and code requirement: individually filling in each value according to personal research. The reason I decided to do so was to ensure the data I had was as accurate as possible. In my prior experience, I mostly trained regression models on data sets with 10+ features, so I felt that I needed to protect the integrity of my eight features any way that I could.

I was able to take advantage of a great method for filling missing values that is available to Pandas’ DataFrames (creatively) called .fillna(). This one method was able to significantly shorten the amount of time spent coding once I retrieved the correct information from the Internet.

In words, my strategy was:

  1. Find and collect the indices for those elections in each year that had a missing turnout value
  2. Navigate to the corresponding web page on Wikipedia to obtain the reported number (Google search if not there)
  3. Arrange these values into a nested dictionary; the outer having the years as the keys and the values being a dictionary. This inner dictionary has the indices of targeted rows for keys and their respective turnout number(s) to be filled in
  4. Put the inner dictionaries by each year as the value= parameter in the .fillna() method to assign the correct value to the right index
  5. Set the method= parameter to forward fill any missing values, entered as “ffill”. This method fills in the value from the closest row before the target row (according to the index).
  6. Changed the turnout values from strings to integers and stored the results in a dictionary for further cleaning

Here is how I implemented my strategy in the code:

Highly condensed version of code from a custom function I used in the project

The original code included more years in the year_dict than displayed above in addition to other operations being performed on each of the tables. However, the snippet above demonstrates my strategy for handling missing values in the “Turnout” column.

Getting the party started

My plan for prepping the “Party” column was very straightforward despite the variety of values included in this data set. Originally, it contains about 190 unique parties that candidates have pledged their allegiance to over the years. If I decided to one-hot encode these, I would be adding that same number of 190 to the amount of features to consider in my regression. This, in addition to potentially needing to one-hot the “State” column, would quickly lead to a sparse data set with a vast majority of zeroes.

As defined on page 6 of the Defining Data Objects developer pages from Oracle, sparse data is:

“A variable with sparse data is one in which a relatively high percentage of the variable’s cells do not contain actual data. Such ‘empty’, or NA, values take up storage space in the file”

This storage space is precious, as the more of it you can preserve the better your machine can process the data set. The same Oracle developer page recommends actively managing your data’s sparsity by “keeping analytic workspace size to a minimum” which in turn will promote good performance.

Keeping this in mind I transform the 190+ parties into encoded representation consisting of five different values. I decided how to group them based upon their representation all-time, allowing those with greater than a 5% participation rate to maintain their sovereignty; this equated to a requirement of roughly 275 candidates to make the cut.

Here’s how I coded this in:

Snippet from ‘st_mapped_cleaner’ custom function used to encode political party membership

Messing with the seating chart

Some of the tables scraped from Wikipedia included values that indicated one of two things about a given political party’s presence in U.S. Senate. I was able to know the raw count of how many seats any party either has in the senate at the beginning of the year’s election cycle or are up for election in this year’s cycle (most states do not elect both of their seats at the same time).

While these raw counts could have been used in a regression model as-is, I wanted to transform them into a standardized way that could be easily understood upon a glance. Standardizing these values would make for easier comparison across time because the United States’ senate, along with the total amount of possible seats, has grown since 1920. I ended up with these two features:

  • Seats_before% → The number of seats held by a party at the end of the last election cycle. Expressed as a function of number of seats held as of last cycle divided by the total seats in senate available (at that time)
  • Seats_up% → How many of a party’s seats that are up for grabs in the current election year. Created as a function of total seats up divided by total seats held at the beginning of the elections

Here’s how I did it:

Feeling hot, hot, hot

My next alteration was to one-hot encode my relevant categorical variables such that they can be represented numerically and thus useful in a regression model. This is easily achievable through the use of the Pandas’ .get_dummies() method that, once called upon a DataFrame, will convert all non-numerical columns into “one-hotted” representations. It is a bit easier to see visually, so I’ll share the before and after for the columns in my modeling DataFrame.

Before:

.info() display for X data prior to one-hot encoding

After:

.info() display of X data post one-hot encoding

This time-saving method took the two categorical columns I had and spread the data accordingly across brand new columns. These new columns in essence serve as a binary representation of the original data’s form. For example, with regard to expressing a candidate’s party, every declared Socialist would have a 1 in the “Party_enc_S” column and a 0 in each of the other “Party_enc_” columns. A key detail here is lack of a column to represent Democratic party membership. This is due to a parameter available when using .get_dummies() called drop_first= . Using this parameter will help reduce redundancy in the data by causing what I like to refer an assumption.

Essentially, if you know the total number of possibilities in a given category (let’s say 8), you only need to keep as many columns as that number minus one (8–1=7). This means that for one row, if all one-hotted columns in a given category are set to 0, the model is assuming a 1 value is present in the remaining possibility not included in the columns. With respect to my project and the “Party_enc_” columns, my model would assume Democratic party membership.

Forward thinking

The last change I made to the data set prior to modeling was re-interpreting the year value as a number that can be useful in a regression model. My issue was not in the formatting, as each year is represented as a number already, but in the concept of a year as a number. Simply put, my model would not be able to interpret the number 2028 in a way that would allow it to use what it learned from the 2008 elections.

To allow for a sense of relativity comparative to now, I used the Pandas Series .map() method in conjunction with a lambda function to create an origin point from which all year values would be generated from. It was an easy choice to use 2020, as it is the year this project was made, but additionally made an even -100 to represent the oldest information I had from the 1920 elections.

Here is how I did it:

Training the Model

Now that I’ve explained some of my choices leading up to the modeling process, let’s explore how well Sci-Kit Learn’s random forest regressor was able to perform using the data set I created!

Train-Test Splitting the vote

When preparing to train a model using supervised learning, one needs to know how to evaluate and interpret that model’s performance with respect to the initial task (the big question).

One common way to do this with relative ease is through the use of the train-test split function found in Sci-Kit Learn’s “model selection” module. Keeping this post’s length in mind, there is a lot more information regarding what a train-test split is, when to use it, and some useful features the function provides in this blog post by Jason Brownlee.

In my case, I used a very simple implementation, keeping with the default parameters and setting a random state for replication’s sake.

Creation of one-hotted X data for context; train-test split on X_ohe

Now I have a mini-data set that can be used to cross-reference my model’s predictions against after the training is complete!

Picking the model

In my opinion, one of the most fun parts about machine learning is how many different ways a single task can be approached and completed. When choosing an actual supervised learning algorithm for a task, this variety continues. In my project, I created a custom function that helped to streamline a potentially time-intensive process. This function would:

  • Take in train-test split data (shown above)
  • Train one or multiple instances of a model (regressors in my case)
  • Make predictions using training and testing data sets
  • Generate and store evaluation metrics for each model in a DataFrame
  • Optionally display of metrics during initial generation

If you’re interested in seeing more about this function, check out this link to the actual code I used!

Keeping in mind my deadline, I decided to test six different algorithms using this function before committing to further parameter tuning. Here are my results (names were added for clarity):

Evaluation table from custom function

The table generated in by my function displays each’s models performance for both the training and test data sets. Indicated by the coefficient of determination (R-squared), the mean squared error (MSE), and it’s root variant (RMSE), I was able to get a general sense on how well each model performed during it’s trial run.

However, I ultimately was making my decision based upon which model produced the smallest RMSE value, which happened to be the random forest regressor from Sci-Kit Learn (Index 1 in image above).

Tuning it up

My next step after choosing which model to train was to see how much better I can make it’s predictions beyond the trial run. This process is better known as “hyperparameter optimization”, or simply as tuning. Wikipedia defines it as:

“…the problem of choosing a set of optimal hyperparameters for a learning algorithm.”

Additionally, they expand upon what a “hyperparameter” is with reference to machine learning:

A hyperparameter is a parameter whose value is used to control the learning process. By contrast, the values of other parameters (typically node weights) are learned.

If you’re starting to notice a pattern here, there is a variety of hyperparameter combinations one could use in a single supervised learning model. This means that there were a near, if not literally, infinite number of combinations I needed to attempt to account for when trying to improve performance past my trial run’s results. Luckily, Sci-Kit Learn is able to come to the rescue again!

The GridSearchCV function is available from their “model selection” module and makes the daunting task of hyperparameter tuning just a bit less scarier. This magical function can train a single type of model with every conceivable hyperparameter combination that you make available to it. Once this is done, it can tell you which combination performed the best based upon the metrics that you tell it to track too!

It is super easy to use too; Once I had my random forest object, I just needed to create a dictionary full of the parameters I wanted to test and the function did the rest! Here’s how I used it:

Example code setting up and fitting GridSearchCV

After the .fit() is done, the grid search object (grid_s in my case) is now able to return the search results using various methods or attributes. Using the grid_s object from above, I was able to display my results by coding:

GridSearchCV results from code above

The above grid search did not result in the final hyperparameters I ended up using in my model, but I thought this could still illustrate the process in a general sense. After executing a number of searches, taking the “best parameters” from prior searches into account when creating the dictionary to be used in the current search.

Getting Results

At this point in a regression problem, the workload shifts from the brain and coding fingers over to the hardware buried inside the computer. To recap, I have:

  • Filled in missing values
  • One-hot encoded categorical variables
  • Created numerical variables
  • Train-test split the data
  • Compared multiple regression models
  • Conducted a grid search for optimal hyperparameters

The training begins..

Because I was using the same custom regressor testing function from earlier, I only needed to instantiate a regression object with my final hyperparameters and run it through the function with the necessary data. In addition, I generated a scatter plot of residuals for both the testing and training data sets as well as a feature importance plot to help visualize the prediction results for interpretation.

Here are my final results:

Final results table for tuned model

The final model was able to achieve (approximately) a RMSE of 10.86 and R-squared of 82.92%, a 0.64% and 0.36% improvement over the trial model respectively. Although the grid search did not yield much of an improvement, being able to model a data set scraped from Wikipedia with such a level of accuracy feels great for my first attempt at such a project.

The residuals:

Scatter plot visualizing residuals of testing data

The scatter plot of residuals helps to visualize some interesting patterns in the predictions made by my model.

For example, on the lower-left, there appears to be a group of data points that are mirrored over the blue, “true value” line. This indicates that there is a group of candidates in the data the model seemingly could not understand and therefore made a guess of some sorts. Similarly in the category of “misunderstood” candidates, there is a small group of points at the bottom-right of the graph, that were predicted to receive a low percentage of the vote despite actually netting nearly 100% of their respective turnouts.

Secondly, at the top-right in which my model simply under-predicted almost every candidate who actually received more than 70% of their respective turnout. For some reason unknown to me, the model is missing something in the underlying patterns of the input data that would allow it to place it’s predictions closer to the “true line”.

Lastly, the scatter plot shows how well my model performed with those candidates who fell into the “middle of the pack”. There is a clear, consistent grouping around the center (35%-65%), that demonstrates a solid understand on what makes a candidate fall into this range.

Taking these observations into account, if the model performed in this manner along the entire plot, my results would improve greatly. At minimum I can use them to help improve my data set for later work.

The feature importances:

Feature importance plot for final model (missing ‘Seats_before%’ = 0.77)

A huge benefit to using a Sci-Kit Learn random forest is access to the “feature importances”, which are values that represent how much influence each feature has on the regressor’s prediction. Simply put, the higher the value, the more weight each feature has in moving the needle; this does not indicate which direction the needle moves however.

Sci-Kit Learn’s website provides a warning for their importance plots:

“Warning: impurity-based feature importances can be misleading for high cardinality features (many unique values).”

Looking at the feature importances, it is clear that relatively speaking:

  • Being the incumbent senator has more of an effect on a candidate’s results than their party allegiance
  • How many seats a candidate’s party has up for election is less informative than the year itself
  • How many votes are cast is more informative on a candidate’s performance than how many terms they’ve served

A brief disclaimer:

Although at this time I am not sure, the feature that received the highest importance was the “Seats_before%” variable and may have been effected by this misleading behavior. It’s importance value was ~0.77, which was over 10 times higher than of the next highest value of ~0.07.

Due to this, I filtered this value out of the above graph to allow for the remaining features to be shown in better detail. This is because of the nature of the values themselves; they do not indicate how much the given feature positively or negatively effects the prediction. Thus, in doing so, I wanted to be able to understand the relative importances more than the values themselves.

Are We Done Yet?

In summary, the Fili-busted! series has documented in various styles, how I took some pages from Wikipedia and used data science techniques to create a machine learning model that can be implemented for future use.

Also, as a side-effect, I learned a great deal about the U.S. Senate, it’s past elections, and former members. Voting is one of the most powerful tools people of any nation have to make a change in their government. Besides displaying a project I am proud of, I intend this work to be part of a larger project. One that can help those who want to make a change, do so.

But enough about me, if you’ve made it this far, I really appreciate your support and hope to continue to share more of my projects over time.

Links

Fili-busted!

  • Part 1 — A web scraping introduction
  • Part 2 — Web scraping multiple pages
  • Part 3 — Exploring the data

Link to full project

--

--