Gotta block ’em all

Earlier today I announced that omg.lol will block all AI-related bots, and I put those changes into production on the load balancer itself. Thousands of requests have already been blocked after a relatively short period. Mission accomplished?

Not quite. As Robb recently discovered, we can’t rely solely on user agents to identify problematic HTTP requests. The thing about a user agent header is that it’s entirely voluntary: you don’t even have to send one at all to make a web request. Or, you can set its value to anything you’d like. The AI companies know that we’re fending them off at the user agent level, and all they have to do to circumvent our efforts is change their user agent to something else. It’s depressingly easy for them to cheat.

And when they do, they win. They scrape and slurp, ingest and train, monetize and B2B and SaaS their solutions to their investors’ delight—all powered by our data. We may think that we’re going to come out on top with a robots.txt file or a cleverly-configured web server, but all we’re doing is keeping a few honest players away (for now, at least). Meanwhile, the teeming masses of dishonest players are feasting on our stuff while we’ve lulled ourselves into a somewhat false sense of security.

Case in point: I was reviewing omg.lol logs earlier, and I found 30 requests made in quick succession, all from the IP address 23.26.220.22. They were all requests for omg.lol member profile pages. They all had this user agent:

Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5 (.NET CLR 3.5.30729)

I did a whois on the IP address, and found that it belongs to some generic data center company (“ACE Data Centers, Inc.”). I flipped back to my logs for an updated look, and those 30 requests had leapt to nearly 600 from the same address. Hmm.

The thing about that user agent is that it looks like a regular browser (Firefox) on a regular computer (running Windows). It’s a little outdated (Firefox 3.5.5 was released in 2009, and the referenced OS is Windows 7), but there’s no law against running older software. But let’s be real: those 600 web requests, for 600 unique resources, made in a relatively short span of time, did not originate from a Windows machine running an outdated version of Firefox at a datacenter somewhere. It’s a scraper of unknown origin, it’s been active for months, and we have absolutely no idea what it’s doing with the data that it collects.

This is the way the open web works: we put stuff out there, and it’s free for the taking. Only recently have we become more concerned with who’s doing the taking, and we’re trying to take steps to put some controls around that. But the example I shared above happens over and over, day after day, right under our noses. We can see it clearly in hindsight, but by then it’s too late—our content has been taken.

I just paused for a moment to take a fresh look at omg.lol logs, and found 279 new requests that originated from 44.206.236.122 and spanned roughly two minutes (so, about 2.5 requests per second). They, too, were all for omg.lol member profile pages. These requests used two different user agents:

Mozilla/5.0 (Linux; Android 13; Pixel 6 Pro) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Mobile Safari/537.36
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.246

As before, these are identifying as regular browsers on regular devices—not bots or other automations. But the IP address belongs to Amazon EC2 (a cloud services provider), and the nature of the requests—loading member web pages, one after another, in rapid succession—shows clear signs of automation. So who’s doing this, and what are they doing with the content they’re retrieving? Could they be selling it to the companies that we’re blocking? Who knows.

There’s really no good solution here. This is the open web functioning as it was intended: we put stuff out there, and anyone can help themselves to it. The open web wasn’t designed with stantions and security guards in mind. It was designed for use by people, and the loose and inconsistent robots.txt specification shows the degree to which trying to manage non-human interaction with our content has always been a mere afterthought. And here we are, mid-2024, in the middle of a raging AI hype cycle, faced with a need to do some very precise management of non-human visitors to our websites. But we can’t, because the modern web is built around ways to serve content efficiently, not to control how and to whom.

If we’re going to tackle this problem effectively, we’re going to need to put a lot of thought and effort into different solutions. We need smarter web servers that can use heuristic techniques to differentiate human visitors from machines. We need better IP blocklists and easier ways to share and import them into our web stacks. Nobody wants to see CAPTCHAs acting as the nightclub bouncer to our content, but maybe we need an updated and modern technique for rapidly and accessibly proving humanity. We’re gonna need a whole bunch of stuff if we’re ever going to take this to a level beyond a cheap bicycle lock at the user agent level.

It’s also entirely possible that this battle is ultimately unwinnable, but that’s a topic for another blog post. For now, I’m going to focus on fighting the good fight.