Almost every website and services are getting scraped at alarming rate, are Lemmy servers facing this issue?
Please share mitigations you’ve seen applied to this.
They don’t really need to scrape. They just have to set up their own federated instance and the ActivityPub protocol will willingly hand it all to them in a nicely parsable format.
One link on your website leads to a neverending labyrinth of nonesense to slowly poison a LLM.
I use this nginx extension.
slrpnk.net has an AI intercept called Anubis, fwiw
It’s very easy for any activitypub content to be scraped, all servers practically serve the content on a silver platter to any federated server.
We made a post about our actions here
I’m sure the AI devs so lazy they cannot train their AI on anything other than scraped HTML can set up a Lemmy instance and point their crawlers at that.