Leaked list shows Facebook training their AI on multiple Lemmy instances

geneva_convenience@lemmy.ml · edit-2 15 days ago

Leaked list shows Facebook training their AI on multiple Lemmy instances

Rimu@piefed.social · 15 days ago

Check out the robots.txt on any Lemmy instance…

usernamesAreTricky@lemmy.ml · 15 days ago

Linked article in the body suggests that likely wouldn’t have made a difference anyway

The scrapers ignored common web protocols that site owners use to block automated scraping, including “robots.txt” which is a text file placed on websites aimed at preventing the indexing of context

mesa@piefed.social · edit-2 15 days ago

Yeah ive seen the argument in blog posts that since they are not search engines they dont need to respect robots.txt. Its really stupid.

AmbitiousProcess (they/them)@piefed.social · 15 days ago

“No no guys you don’t understand, robots.txt actually means just search engines, it totally doesn’t imply all automated systems!!!”

belated_frog_pants@beehaw.org · 15 days ago

Scrapers ignore it

Rimu@piefed.social · 15 days ago

Thieves can smash a window to get into my house but I still lock my doors.

belated_frog_pants@beehaw.org · 14 days ago

This is more like being there when they come to steal and you ask them to ignore some rooms please.

Pamasich@kbin.earth · 15 days ago

If they have a brain, and they do have the experience from Threads, they don’t need to scrape Lemmy. They can just set up a shell instance, subscribe to Lemmy communities, and then use federation to get their data for free. That doesn’t use robots.txt at all regardless.