Irresponsible Web Scraping: Ignoring Robots.txt Risks

Mar 6, 2024

The Rise of Irresponsible Web Scraping

The inception of the internet brought with it a Pandora’s box of data, accessible to anyone with the means to retrieve it. However, with great power comes great responsibility, a mantra that seems to have been disregarded by a growing faction of the online community: web scrapers. At the heart of this ethical dilemma lies the humble `robots.txt` file, a creation birthed from the need to establish a modicum of respect and boundaries in the digital realm.

The Genesis of robots.txt

In the early days of the internet, webmasters faced a burgeoning issue: the unregulated access of bots to their websites. These bots, designed to index the web for search engines, often consumed significant bandwidth and resources, hindering website performance and accessibility for actual human visitors. In response, the `robots.txt` file was invented in 1994 as a gentleman’s agreement, a simple protocol allowing website owners to indicate which parts of their site should not be accessed by bots. This agreement was predicated on mutual respect and the understanding that the digital ecosystem thrived on cooperation and consideration.

The Disregard of Digital Etiquette

Fast forward to the present, and the landscape has dramatically changed. Companies like ByteDance (the parent company of TikTok) stand accused of completely ignoring the agreements that `robots.txt` represents. Armed with vast AWS (Amazon Web Services) subnets, these digital behemoths deploy bots to scrape the internet indiscriminately, in a relentless quest for data. Their actions, devoid of any consideration for the impact on targeted servers, epitomize a new era of digital irresponsibility.

The ramifications of such behavior are profound. Free information gathering, while beneficial for the scraper, can impose substantial costs on the owners of targeted websites. E-commerce sites, in particular, are heavily impacted, as the additional server load can slow down websites to a crawl, affecting sales and user experience. The irony is palpable; in their quest for data, these companies hamper the very commerce they seek to understand and profit from.

The Gold Rush Mentality

The analogy of gold diggers is apt for describing the current state of web scraping. Like miners who extract valuable resources without regard for the environmental or social costs, these digital extractors mine data with no thought to the burden they impose on others. The server owners, much like the land ravaged by gold diggers, are left to bear the cost of their operations. Worse still, they receive no compensation for the data that is extracted, data that is often monetized by the scrapers.

A Call for Responsibility

The issue at hand is not the act of web scraping itself but the manner in which it is conducted. A blatant disregard for `robots.txt` is symptomatic of a broader disrespect for the norms and etiquette that underpin the functioning of the internet. When multiple bots descend upon a website simultaneously, bringing servers to their knees, it’s a clear sign that the balance has been lost.

As we move forward, there is an urgent need for dialogue and regulation. The digital world must find a way to coexist with bots in a manner that respects the rights and resources of all stakeholders. Companies like ByteDance, and others who follow in their footsteps, must reconsider their approach to data gathering. The sustainability of the digital ecosystem depends on it.

The internet was built on principles of openness and cooperation, but these principles are being challenged by the actions of a few. It’s time to reclaim the ethos of mutual respect and consideration that `robots.txt` symbolized, ensuring that the digital gold rush doesn’t come at the expense of the very infrastructure that supports it.

Blocking ByteDance Scraper with .htaccess

RewriteEngine On
# Check if the User-Agent header contains “bytedance.com”
RewriteCond %{HTTP_USER_AGENT} bytedance\.com [NC]
# Deny access if the above condition is true
RewriteRule ^ – [F,L]

Adding other Scrapers with example

Adding other scrapers is of course possible i.e.

i.e. adding semrush and Majestic bot to the mix you would get the following example:

RewriteEngine On
# Check if the User-Agent header contains “bytedance.com” OR semrush OR mj12bot
RewriteCond %{HTTP_USER_AGENT} bytedance\.com [NC,OR]
RewriteCond %{HTTP_USER_AGENT} semrush\.com [NC,OR]
RewriteCond %{HTTP_USER_AGENT} mj12bot\.com [NC]
# Deny access if the above condition is true
RewriteRule ^ – [F,L]

Please note that the last line does not contain an OR anymore. Hosted with us ? we are here to assist you batteling these bots, simply contact our support department and we can check and act if your site is affected by these practises.