AI has changed the web forever.
Large language models (LLMs) are changing how information is produced, shared and consumed on the web. In fact, estimates suggest that now more than half of all web traffic is made up of bots, with a sizable amount of that traffic coming from so-called bad bots, bots carrying out malicious or unwanted activity.
For developers, this presents new kinds of challenges. In the past, the threat of content theft came only from humans. However, developers must now ensure their content (be it code, video content, tutorials or otherwise) isn’t silently indexed and ingested, repurposed, or even monetised - without their consent - by AI bots.
In this article, we’ll show you how to find out what LLMs know about you, how to block bots from crawling your content, and how to monetise it by charging bots each time they crawl.
1. The Manual Way
To start off, let’s do things the manual way. This approach is by no means scientific, but if you ask your chosen LLM, “Without using search, relying only on indexed data, what do you know about [me / my website]?” or “What can you tell me about GitHub user [your username]?” you’ll get an informal sense of whether your content has been indexed along the way.
At this point, it’s worth saying that this approach won’t give you a dataset, but it’s a quick way to detect whether your content is already circulating inside LLM responses.
Also, there’s a strong possibility the LLM will hallucinate or simply fabricate a few aspects of its response, so take it with a relatively large lump of salt.
2. Check If Your Code Is In Training Data
BigCode is a data set widely used in the training of LLMs, and Hugging Face has a handy resource called Am I in The Stack? that lets you check if your GitHub repositories (and the code within them) are included in it.
Just search for your GitHub username, and if you find your code listed then you can file an opt-out request for future versions. It’s not perfect, but this is one of the few transparent ways to verify whether your open‑source contributions are part of the AI training ecosystem.
3. Inspect Logs to Detect AI Crawlers
If you want more concrete evidence of AI crawling, your site’s access logs are a powerful resource, showing each request to your site, regardless of the source.
Of course the simplest way to check is to manually review these logs, but there are lots of dedicated tools out there to make things easier.
Screaming Frog’s Log File Analyser lets you upload log files and quickly visualise which bots are visiting, how often, and which pages they’re targeting, or GoAccess provides an open-source real‑time dashboard from your logs, highlighting unusual spikes or suspicious crawlers.
4. Use robots.txt to Set Boundaries
Once you know which AI crawlers are indexing your site (and where), you can establish clearer boundaries.
Updating your robots.txt file allows you to - in theory - block or allow specific crawlers, using directives like:
User-agent: GPTBot
Disallow: /
In addition, you can apply meta tags such as <meta name="robots" content="noindex">
on sensitive pages to prevent them from being indexed.
Unfortunately some rogue crawlers may ignore these signals, but major players are increasingly pressured to respect them, especially as scrutiny grows around how LLMs collect training data.
5. Monetise or Restrict Access
Despite AI bots silently crawling (and monetising) existing content for the past few years, the tide might be about to turn with Cloudflare’s introduction of their pay-per-crawl program.
Pay-per-crawl gives developers an enforceable way to control or monetise AI access, with crawlers that attempt to scrape sites running through their servers can be met with an HTTP 402 “Payment Required” response, which forces them to either comply with your terms or back off.
Essentially, rather than AI silently crawling and extracting your content, you have a choice over whether to block it, allow it, or monetise access.
For developers, this marks a significant shift: rather than being sidelined while your content fuels LLMs, you can finally establish real terms of engagement.
Summing Up
Relatively speaking, we’re in the early days of the AI age. It’s transformed our lives in numerous ways, including how we surf the web, but the days of LLMs scraping and ingesting content, unimpeded, appear to be coming to an end (or at least we can hope!).
We hope this article gives you some insights into how your content is being used, and empowers you to have better control over it, too.