Don’t Web Scrape Like a Robot

How to avoid being blocked

5 min readDec 4, 2020

Photo by Possessed Photography on Unsplash

In this post I will be going over one of the many ways one can save themselves from losing access to a specific website during web scraping. Specifically, I will detail the creation of a custom Python function I used to scrape a popular song lyric website for a Natural Language Processing project.

Since this post will be more about implementation rather than theory, this walkthrough may not work for your situation. If you are looking for more ideas on block-avoiding options I would recommend checking out this great blog post by Manthan Koolwal.

Getting Started

To create this function, there are a few packages that you will need to import in order for it to work as intended:

Time — This package is key to how this function will work
Requests — Handles the connection to each webpage
BeautifulSoup (bs4)— Allows for programmatic exploration of HTML scripts
Numpy — Handles the generation of random numbers to be used within function

Once these are installed/imported, we can begin building this function out! But first, I will explain the “why” behind this function, which will help make sense of the structure it will ultimately take.

Superhuman Scraping

Prior to this NLP project, my web scraping experience was limited to sites such as Wikipedia (which has pretty lax rules) or using API with an personal key, which will often clearly communicate how many times one can connect to the database/website before penalties occur.

So, I was quite surprised that my large-scale scraping function failed after prototyping the same scraping code with smaller batches of connection requests. Then, terror struck as I attempted to troubleshoot, as I was greeted with this screen after trying to connect to the website in my Chrome browser:

I later found out that this is a warning, further attempts on scraping left me with a completely blank page without an option to verify the “I’m not a robot”; essentially my IP address was blocked from this site, nothing connected to my Wi-fi would work.

After doing some research on the topic, my IP address was flagged due to the superhuman amount of connection requests my computer was making while collecting lyrics. This is because my code when ran was executing as fast as my hardware would allow for, which is way beyond the speed a human could achieve repeatedly refreshing the page. Making connection requests at such a pace can be indicative of a Denial-of-service (DoS) attack. Wikipedia defines this as:

…a cyber-attack in which the perpetrator seeks to make a machine or network resource unavailable…[by] disrupting services of a host connected to the Internet. [DoS] is typically accomplished by flooding the targeted machine or resource with superfluous requests in an attempt to overload systems and prevent some or all legitimate requests from being fulfilled.

In most cases, one would not want to slow down their code, but there is always an exception to a rule.

Slowing Things to a Crawl

My plan was simple: augment my original scraping function to work on a batch-basis with a randomized delay, hopefully bypassing any quantity or time-related security protocols by scraping in a non-consistent manner.

The Plan

In words, here’s what I needed to do:

Use Numpy to create a list of non-integer numbers from which I would randomly select a period of time (in seconds) to make the code “sleep”.
Set a variable to keep count of how many songs have been scraped; this will serve as a limiter
Set a variable to track the amount of songs skipped over since I will be doing scraping in batches
Check that my limit has not been reached before scraping
Set the URL to be scraped into a variable or skip if song has been scraped already
Connect to webpage and pull necessary info via BeautifulSoup and store it into variable
Track this as successful scrape attempt
Put code execution to sleep for a randomly selected amount of time
Continue until limit reached

In my case, prior to creating this function, I scraped the all of the links to the webpages for the song lyrics and stored them in a dictionary. Each key was the song’s title and the corresponding value was the ending portion of the URL to the lyrics. I would suggest taking a similar approach if possible.

Coding It Out

Steps 1–3

The numbers I chose to be the lower/upper limits of the list were arbitrary. My goal was to get a list that would make the function alternate between seconds to minutes between requests without repeating values in a way that would trigger another block.

Steps 4 & 5

The intuition behind the “try/except” statement stems from how each scraped song will be stored back into the dictionary as a BeautifulSoup tag. This allows the ending URLs, stored as strings, to be targeted based upon their data type. Step 9 is implemented in the “for” loop and “if” statement enforcing the limit set as a hyperparameter.

Step 6

The if statement here was implemented as a failsafe in the event I get blocked, allowing the code to continue, preserving whatever songs were scraped in that attempt.

Steps 7 & 8

After some thought, in order to help avoid repeating sleep times, I added a random selection of a rounding factor to apply to the sleep time when executed.

For example: if the alarm value was 55.5472 on two consecutive iterations, the rounding factor could be 3 then 2, resulting in the executed sleep times being different (55.547 and 55.55, respectively).

The Final Function

My finalized code includes some lines that are optimized to give the user feedback during the execution of the code, since the sleep times varied and I wanted to make sure everything continued to work in the background. These are purely decorative and do not need to be put in your own version. In fact, I recommend using my function as scaffolding to be personalized to each project’s or user’s needs.

Here it is:

Conclusion

Simply put, the Internet is a place full of rules and a lacks a comprehensive tutorial. Hopefully, this function/strategy will be able to save you some time, and hassle at least. Best case scenario, it will save your IP address from being banned from a data source crucial to a project’s completion.

Follow me on Twitter

Connect with me on LinkedIn