Bots are currently scraping the internet for LLM training data at unprecedented rates[1][2][3], driving up costs and destabilizing public-facing websites. I want to talk about how this has been particularly difficult for wikis, and has gotten much worse in the last few months.
I’ll admit that I am a newbie, so I ask in ignorance: have you tried using Anubis + BadBotBlocker + Fail2Ban?
It genuinely worked wonders for my tiny site that was being bombarded.
Crawlers have never been a problem for me as my internet subscription is unlimited. My experience comes from being the one crawling sites and bypassing the CF challenge