Skip to content

Latest commit

 

History

History
57 lines (38 loc) · 4.54 KB

FAQ.md

File metadata and controls

57 lines (38 loc) · 4.54 KB

Frequently asked questions

Why should we block these crawlers?

They're extractive, confer no benefit to the creators of data they're ingesting and also have wide-ranging negative externalities: particularly copyright abuse and environmental impact.

How Tech Giants Cut Corners to Harvest Data for A.I.

OpenAI, Google and Meta ignored corporate policies, altered their own rules and discussed skirting copyright law as they sought online information to train their newest artificial intelligence systems.

How AI copyright lawsuits could make the whole industry go extinct

The New York Times' lawsuit against OpenAI is part of a broader, industry-shaking copyright challenge that could define the future of AI.

Reconciling the contrasting narratives on the environmental impact of large language models

Studies have shown that the training of just one LLM can consume as much energy as five cars do across their lifetimes. The water footprint of AI is also substantial; for example, recent work has highlighted that water consumption associated with AI models involves data centers using millions of gallons of water per day for cooling. Additionally, the energy consumption and carbon emissions of AI are projected to grow quickly in the coming years [...].

Scientists Predict AI to Generate Millions of Tons of E-Waste

we could end up with between 1.2 million and 5 million metric tons of additional electronic waste by the end of this decade [the 2020's].

How do we know AI companies/bots respect robots.txt?

The short answer is that we don't. robots.txt is a well-established standard, but compliance is voluntary. There is no enforcement mechanism.

Why might AI web crawlers respect robots.txt?

Larger and/or reputable companies developing AI models probably wouldn't want to damage their reputation by ignoring robots.txt.

Also, given the contentious nature of AI and the possibility of legislation limiting its development, companies developing AI models will probably want to be seen to be behaving ethically, and so should (eventually) respect robots.txt.

Can we block crawlers based on user agent strings?

Yes, provided the crawlers identify themselves and your application/hosting supports doing so.

Some crawlers — such as Perplexity — do not identify themselves via their user agent strings and, as such, are difficult to block.

What can we do if a bot doesn't respect robots.txt?

That depends on your stack.

How can I contribute?

Open a pull request. It will be reviewed and acted upon appropriately. We really appreciate contributions — this is a community effort.