Skip to main content

Ideas for Proof of Work crawler protection

A few ago I wrote a post about Anubis in which I called it kinda crap but fine. Of course my brain didn't shut off after that so, uh, here are some thoughts on how one could solve some of the issues of the proof-of-work approach Anubis uses. I am sure there is prior literature & I am just reinventing the wheel but I wanna see how far I can get on my own (with the design for now, I can't program for shit...).

In the previous blogpost I highlighted the work imbalance by the approach Anubis uses - provide a proof-of-work solution once, get a cookie, do all the requests you want within a multi-day time period. Or in other words: required work does not scale with amount of requests. If you intend to keep the site useable for flesh-and-blood users crawlers don't really pay a high cost either. Unfortunately you can't really limit the amount of calls with one token either (withouth going stateful, ew)... so what if we just required constant proofs instead?

Basic design

Let's say a client is requesting $url. To access that ressource the client also needs to provide $solution so that $solution != mod($url) & hash($solution)[0:n] = hash(mod($url))[0:n] where n is a measure of difficulty, hash is some hashfunction, and mod is a modifying function for the URL - what it will need to do depends how we solve some issues below.

How do we get $solution to the gateway? Cookie might be possible but seems messy, same for body (oh god, I don't want to think about the packing/unpacking magic required for protecting POSTs...). I am seeing two options: Headers and queries.

Headers would be kinda neat because it doesn't require modifying the URL, you are however running into the issue that providing pre-validated links on other websites require JS. This probably isn't a KO criterium, though it will require some extra engineering for requests with a body coming from other websites.

Adding $solution as a URI query parameter feels a bit messy at first. First, it is user-visible which... eh. Second, it either requires removal of the parameter in mod or $solution becomes self-referential - this shouldn't technically be an issue but ... is kind of a mindfuck. But it has one major benefit: It doesn't require messing with the actual requests per se, you can just link to $url[?/&]solution=$solution directly. It's probably the approach I'd prefer.

Restoring Useability

Eh, so, all of that is nice but we just took useability out back and put a bullet through it's head, in two different ways - maybe more I haven't found yet: First, loading a site with subressources (idk, js, css, embedded images, ...) isn't gonna work or gonna use a lot of work. Second, you'll get the cute anime Egyptian god pop up & make you wait every time you click a link.

Subressources

We can solve this multiple ways.

First one is kinda lazy: Only require proof-of-work for specific ressources, making the proof-of-work implicit. Wanna access something in /assets/? Sure, go ahead. Wanna access something in /posts/? You better do the homework. The idea here is that there are either ressources that are not very protection-worthy (e.g. the css for a generally available SSG preset) or only accessible through protected ressources (e.g. images embedded in or linked to by protected pages). Important note though: This does not provide any protection against crawlers attempting to enumerate directory contents in unprotected directories directly.

The second one follows the same idea but attempts to provide protections against enumerations: Solutions for subressources are pregenerated on the server & already included in the HTML/JS delivered to the client. This... is messy and may completely fall apart if you are using external JS or whatever... . Not a fan.

Third option ... is kinda my favourite but also infeasible in most cases: If you have a good mapping which ressource is being used by which page we could use the referer header for implicit proof. Sounds like a lot of work and may again allow for certain enumeration-based bypasses.

Honestly, lets just go with the first.

Conclusion

I don't think I can implement this. I have no experience with webshit... at all so I am probably overlooking shit. If you find some issues please let me know though, I'd love to learn what I failed to consider!