How to protect your website from bots

On the web, bots are everywhere. In fact, according to different sources, two-thirds of web traffic comes from bots, 40% of traffic comes from bad bots (scrapping, spam bots, and so on).

Basically, a bot is a very broad term. It includes web-scraping, web crawling, automated trading, automated messaging, and spamming.

If you decide to go online you must be sure you are going to deal with bad bots at some point.

Protecting your website against bots is a real challenge. In fact, many tools, services, software libraries and frameworks exist which facilitate bot development. There are even many software developers who specialize in bot development.

Another challenge is that by protecting your website against bots you can harm user experience. It is not a secret users do not like solving captchas, so when you put captchas to protect your website against bots you also harm user experience. 

Lastly, there are good bots that crawl your website for indexing or monitoring your website. Google, Yahoo, web archives and other search engines need to crawl your website in order to index it. By fighting against bad bots you might prevent good bots from doing their job and having indexing problems.

First of all, let's understand how bots work.

There are 2 main ways that bots work.

  1. By simple HTTP requests.

  2. By browser (including selenium, headless, normal browser and advanced browser that does not have fingerprint)

By using these two mechanisms bots perform actions or take data from your websites. Unlike humans, they work very quickly and can work constantly. Bots work can be multiplied by using several processes and threads. Sequences of actions bots perform must be defined in advance.

There is no single technique that can totally guarantee your protection from bots but there are setup techniques that can be applied to significantly minimize the traffic of bots.

But always keep in mind whatever you do, bot developers might find a way to bypass it. The more complex it is for bot developers to exploit your website the less will try to mess with it. Eventually many would give up using bots on your website.

Here are the set techniques.


Add robot.txt

        Foremost always include robot.txt to your website and define rules for bots. Some bots check   robots.txt and respect the rules. Usually, bots find a robot.txt file in your domain by this URL so place it where.

Read more about robot exclusion here.

Check request headers.

In many cases bots make simple HTTP requests to perform actions on your website by checking HTTP request headers it might be possible to detect a bot.

  1. Always check the user agent from the header of the HTTP request and validate it. If it is absent when you can be sure the request comes from a bot. Even if a user agent is present it still validates it. Some bot developers might be lazy enough to put a real user agent instead they might have put a dummy string.

  2. Check the Referer header, a bot might have an empty referrer header.

  3. You can set and check cookies. Bots that work with simply HTTP requests usually send empty cookies.

           If bots work with selenium or headless browsers you can not detect it with these methods. 


Monitor web traffic by ip.     

Check the frequency of requests by Ip. Bots usually send more requests in a given period of time than normal humans do and they have a very low bounce rate meaning they stay on a page very short period, also bots send requests constantly for a long period of time  You can monitor traffic by ip and if a bot does not change its ip you can catch it.

Use honeypots

Use honeypots to trap bots. To navigate desired pages, bots might blindly look for navigation links. With the honeypot method, you put a fake navigation link but hide it from a real user with CSS or javascript. A bot will click or take the URL and visit that page. But please make sure to make such that search engine bots won't follow it. You can tag noindex or something like that. Check this article for more details. 

Make your website and content unpredictable for bots.

Make your website, web content and structure as unpredictable as possible. This will make many bots fail. When a bot developer designs a workflow, which is going to be converted to code, they follow standard and predictable patterns.  By making a bot based on these patterns, the bot might fail when something unpredictable happens

  1. Change your website selectors frequently. Selectors play a very important role for bots. It is selectors that are being used by bots to perform automated tasks in websites or parse HTML content. If you change selectors, bots might fail to perform a given job.

  2. Have multiple selectors for the same content and rotate them randomly. For example, you might display the list of products and set their class name to product other time item and so on. 

  3. If you are returning data in JSON or XML format try to change structure frequently or have several structures and rotate them randomly 

          This unpredictability drives away bot developers and very few will mess up your website a second time.  


Block requests with proxies.     

If you detect bot and restrict bot activity by ip,  bot developers will use proxies or rotating proxies to bypass this restriction. It is not easy to detect if a request comes from a proxy. Here are some tips:

  1. Check if the client ip is a data center or hosting provider ip because most proxies use hosting services so their ip is not residential. This method has disadvantages. Clients can use VPN with this method, you can block or suspect him/her as well. 

    The second problem is that this client can use residential proxies to bypass this. But very few use residential proxies because it is very expensive.

  2.  You might detect proxies from request headers some proxy services might include via, forwarded, x-forwarded-for, client-ip and so on headers  please check this article.

  3. You can also check if ip has been blacklisted , many proxies have already been blacklisted so you can check it using third-party services.


Disallow requests from selenium  

When you make access for bots difficult with simple HTTP requests many bots will use selenium or headless selenium to bypass it.

 So you should check if the request is coming from a selenium or a normal browser.

  1. Check if the request is coming from a headless browser. Here are some good articles for it 

Also, many headless browsers contain something like this in their user agent HeadlessChrome 

2. Check if requests come from selenium .Usually selenium might contain $cdc_ and $wdc_ variables . Here is a good article for it.

If you block a bot, do not provide a real reason.   

If you detect that the request comes from the bot instead of providing an exact reason or returning 429 error with the proper message, return dummy data or other response or return other error like 500. If the bot developers know that their bot has been detected, they might find other solutions to bypass detection.

Encode useful information

Encode useful information from the back end and decode it in the front end. In many cases, bots are looking for data and by encoding data job of bot developers becomes very difficult. E.g if you return data in JSON format you can encode in your back-end and decode in the front end with javascript, you can encode other valuable data that bots might look for. This, of course, won't cause a problem with end-user they will see real decoded data but a bot will see encoded data.

Make finding your dataset not easy.

           If your website displays datasets, bots will most likely try to scrape bulk data. 

           The way to prevent it is to make search and dataset finding difficult. By saying difficult I      mean for bots, not for normal users, since the way normal users search and bot searches are different.

  1. Do not make your main content with very easily navigable pages like www. domain.com/company?id=1  www. domain.com/company?id=2 and incrementally display content that way.  Bots will increment by one and get all data

  2. If you are displaying results by search do not allow searching by symbols like % and simply 1 or 2 letters or numbers unless it is very needed.

  3. Do not return all data at once, put pagination and limit in your back-end and return max say 20 results.

Use captchas 

Use captchas when you suspect unusual or suspicious activity or use them in places where bots can play or exploit your website such as in submit or search locations.

Captchas should be used in  strategic ways:

            1.  Have control over captcha in the server-side as well, if you place a captcha do not accept requests in the server till the captcha have not been solved. Many services place captcha in the front end and do not check in their back-end if the captcha has been solved this makes manipulation, disable or removing captcha from the front end very easy 

           2. Remember captchas can be solved by third-party services, fill letters or other primitive captchas are not hard to bypass by using third-party services, so place difficult captchas like google`s reCAPTCHA which are very hard to bypass even by third-party services.

Use third-party services to prevent bots      

With the mentioned steps you can prevent bots by yourself, but keep in mind there are multiple services you can use to handle bot prevention for you.

  1. Imperva ( former Distil Networks) is one of the market leaders in bot detection and prevention.

  2. Cloudflare provides such a service. 

  3. Google`s reCAPTCHA Enterprise


There is no doubt that it is a challenge to get away from bots when on the web. However, there are ways to limit interactions with them. But you must be aware not to harm the user experience too much. By following steps such as checking request headers, monitoring web traffic by ip, using captchas and other techniques mentioned above you are closer to protecting your website from bots and increasing the web traffic. 




Previous
Previous

Top 9 Software Development Outsourcing Trends and Technologies to Watch Out For

Next
Next

Powerful Aggregation Framework of MongoDB