Anubis & Obscurity
You may have come across Anubis over the last few months - a WAF, of sorts, that uses automated challenges to filter out misbehaving web scrapers.
It's kinda crap. But that's fine & it's also kinda cool!
Why it is required
There is plenty of "bots" out there & they aren't all bad. A mastodon instance checking verified links on your profile? That's a bot. A social media tool generating previews for a link? That's a bot. The Internet Archive? That's a bot.
Web client accessing a ressource will generally indicate their identity using a User-Agent header.
Browsers user agents are... weird.
What's relevant for us here is that browser user agent strings generally start with Mozilla/5.0)
.
Bot user agents should generally be pretty simple, by comparison: botname/version (some additional info)
.
Additionally bots should follow the robots.txt
if it exists, a file hosted in specific locations on a webserver that will tell bots what they may or may not access. This can be pretty granular, including having different rules for different bots, request frequency limitations, ...
... except all of that is just convention, there is no enforcement. Bots can just ignore the robots.txt
. Clients (incl. bots) can just set any user agent they want & many of the more misbehaving bots do. A common user agent to immitate would be common browsers.
How Anubis attempts to solve this
This is just a high-level description of Anubis, I will skip elements of it if I don't think they are relevant to this discussion. If you want all the details: Here are the docs
Anubis, by default, only attempts to filter out extremely misbehaving bots, those that impersonate (or are!) browsers. For this Anubis check the User Agent - if it doesn't start with/include Mozilla
you get a free pass. This feels quite paradoxical at first as Anubis explicitly doesn't block bots presenting as bots.
After that Anubis issues a challenge, a small piece of JS implementing a mathematical problem that's hard to solve but where a solution is easy to check. Once the client sends a correct response Anubis issues a signed cookie which the client can use to bypass future inspections.
That's it.
Why this is kinda crappy
Trivial bypass
Ok, you are a misbehaving bot, pretending to be a browser. You approach the webserver, get told "lol no"... what do you do? Well, just make the request again but this time using any other user agent. Bam, you are in.
Work imbalance
Anubis is (largely) stateless, from what I can tell. This is kinda neat from a lot of perspectives - all information is maintained in the signed token, no database required on the server. But this also means: Tokens are valid for a certain period, not a number of requests.
Why does this matter? Should any bot obtain a valid token they can scrape the entire website without further checks.
This creates an interesting imbalance: Legitimate users (maybe visiting a handful of pages) have a much higher proof-of-work cost per request than bots & the larger the site to be scraped the more the per-page cost diminishes.
Oh, and from some very ad-hoc testing the in-browser JS codes is pretty unperformant, compared to some hacked together python script.
But does this matter?
And yet Anubis is fine - it will keep out the dumb & malicious bots at a very low cost. Implementing a challenge solver takes effort. Solving them will add up over time if you are scraping billions on websites. Even implementing a Anubis detection & user agent based bypass takes effort. And if you really want to bring down the hammer you can disable user agent checks. And all that for a few CPU cycles on the server & your user end - most of the time I don't even notice the Anubis challenge pages anymore. Oh, and maybe breaking the Internet Archive bot...
But Anubis can not become successful, it must stay a niche application. Should it ever become successful "attackers" will adapt, will exploit those weaknesses... because this is Security by Obscurity. Anubis works by being weird, unexpected, and small. The proof-of-work is secondary. I am not going to go as far as calling it theatre, I don't have the data to proof that. But no motivated attacker will be stopped by it.
Hm, actually, good question, how effective is Anubis compared to a simple "here is a cookie, now do a refresh" JS starting page?...
Obscurity is fine
And again, that is fine. Obscurity can absolutely be part of a defensive strategy, reducing load, reducing noise, keeping out the dumb-as-rocks attackers so you got time & ressources to focus on the ones that matter. Be that in security operations or in the number of cpu cycles you got available on your webserver.
Offtopic
Anyway, this was all a very long winded way of setting up an explanation why my employer should move away from firstname.lastname@domain.tld for employee email addresses because jfc, if someone in middle management asks me ONE MORE TIME how spammers know their email address & I need to ask them "do you have LinkedIn"... .
How about firstname.lastname.random3digitnumber@domain.tld?
Who do you think will kill me first? Messaging team or business?