LLM Crawlers Are the New Bot Problem (And How to Detect Them)

In an earlier post, we wrote about how 51% of web traffic is now non-human — and how AI bots and agentic browsers are corrupting analytics metrics that were never built to handle them. Since then, the problem has evolved faster than most teams anticipated.

The first wave of AI traffic — the scrapers training large language models — is now a known quantity. But a second wave has arrived: LLM-powered crawlers that operate on behalf of real users in real time. These aren't just scrapers. They're question-answering engines, research agents, and AI browsing tools that visit your site, read your content, and report back — without your analytics seeing any of it.

"Your bot filters know what Googlebot looks like. They have no idea what GPT-4o web search looks like."

That gap is growing. And the sites that understand LLM crawlers — what they are, how they behave, and how to detect them — will have a structural advantage in the AI-native web that's rapidly becoming the default.

What LLM Crawlers Actually Are

The term "LLM crawler" covers several distinct types of non-human visitors that have emerged as AI assistants have gone mainstream. They're related but meaningfully different from each other — and from the traditional bots your current filters are designed to catch.

1. Training Data Scrapers (The First Wave)

These crawl the web systematically to collect text for training or fine-tuning AI models. GPTBot (OpenAI), ClaudeBot (Anthropic), Google-Extended, CCBot (Common Crawl) — these are the most visible, as most declare themselves through user-agent strings and respect robots.txt when asked.

They're relatively easy to detect and block. Many sites have already updated their robots.txt to exclude them. The problem: they represent only a fraction of AI traffic.

2. Real-Time Retrieval Crawlers (The Second Wave)

These visit your site on demand, in real time, to answer specific user questions. When someone asks Perplexity "What are WysLeap's pricing plans?" or asks ChatGPT to "summarise the latest articles from [your blog]," a crawler fetches your page, processes it, and returns an answer to the user — who never visits your site in their browser.

Key characteristic: these crawlers are triggered by user intent, visit specific pages, and fetch only what they need. Their traffic pattern looks nothing like systematic scrapers — which is exactly why they evade traditional bot filters.

3. Agentic Browsers (The Third Wave)

These are AI systems that browse the web on behalf of users to complete tasks. Unlike retrieval crawlers that just fetch pages, agentic browsers can navigate, click, fill forms, and interact with dynamic content. Tools like Perplexity Comet, browser-use, Playwright-based AI agents, and increasingly, operating-system-level AI assistants fall into this category.

They're the hardest to detect because they often execute JavaScript, render pages fully, and exhibit behaviour patterns that overlap with legitimate human browsing.

The Scale Is Already Significant

Analysis of server logs across publisher sites in early 2026 shows LLM-related crawl requests running at 15–25% of total traffic on content-heavy sites. For SaaS product sites with detailed feature documentation and pricing pages, the share is even higher.

Almost none of this is captured in standard analytics dashboards. It's not a niche edge case — it's a structural gap in how the web is currently measured.

Why Your Existing Bot Filters Can't Catch Them

Traditional bot detection was designed for a different threat model. Understanding why it fails on LLM crawlers requires understanding exactly what it was built to catch.

User-Agent Blocklists Don't Scale Anymore

Traditional bot filters maintain lists of known bad user-agent strings. This works when there are dozens of known scrapers, each with a consistent signature. It doesn't work when there are thousands of AI frameworks, many of which rotate user agents, spoof browser signatures, or simply use legitimate browser user-agent strings.

GPTBot declares itself. So do ClaudeBot and Googlebot-Extended. But a custom agent built on Playwright or Puppeteer may present as Chrome 131 on macOS with no distinguishing characteristics at all. Blocklists catch the honest actors while missing the ones that matter.

Client-Side Analytics Scripts Don't Run

The majority of analytics platforms — including Google Analytics 4 — are client-side. They rely on a JavaScript tag executing in the visitor's browser. When a retrieval crawler fetches a page without executing JavaScript (which many do, by design, for efficiency), the analytics tag never fires. The visit is completely unrecorded.

This is the invisibility problem: not that these crawlers are hard to detect, but that the detection mechanism itself doesn't run when they arrive.

Behavioral Heuristics Were Trained on Old Patterns

Traditional bot detection looks for patterns like: visits too many pages too fast, stays zero seconds, comes from suspicious IP ranges, doesn't accept cookies, has no mouse movement. LLM retrieval crawlers violate none of these. They visit specific pages, spend a normal amount of "time" fetching them, come from cloud provider IP ranges, and navigate sensible paths.

They look, to old detection models, like a careful human browsing from a cloud server. Which is, in a sense, exactly what they are.

The Robots.txt Problem

Robots.txt is the web's social contract: crawlers that respect it identify themselves and follow your access rules. The problem is compliance is voluntary. OpenAI, Anthropic, and a handful of large players respect robots.txt. The ecosystem of third-party AI tools, open-source agent frameworks, and enterprise AI platforms that are now building on top of these models largely don't.

You can opt out of GPTBot. You cannot opt out of a developer's custom AI agent built on GPT-4o that browses using a Chrome user-agent.

How to Actually Detect LLM Crawlers

Detection requires combining multiple signals that, individually, are weak — but together, create a reliable fingerprint. Here's the layered approach that works in practice.

Layer 1: Server-Side Request Analysis

This is where detection has to start — before client-side JavaScript ever runs. Examine raw HTTP request headers for signals that correlate with automated fetching:

Accept-Language header: Bots frequently send Accept-Language: * or a minimal set. Humans have specific language preferences.
Missing or minimal Sec-Fetch headers: Modern browsers send a rich set of Sec-Fetch-* headers. Many crawlers omit these entirely.
No Accept-Encoding diversity: Browsers accept multiple compression schemes. Minimal bots often send only Accept-Encoding: gzip.
IP range analysis: Cloud provider IP ranges (AWS, GCP, Azure) are overrepresented in LLM crawler traffic versus human traffic.

Layer 2: JavaScript Execution Probe

The cleanest signal: does JavaScript run? Embed a lightweight, non-blocking beacon in your page that fires on load. Requests that arrive at your server without a corresponding beacon hit within a short window (2–5 seconds) are almost certainly non-JS-executing crawlers.

// Minimal JS beacon — add to page layout

fetch('/api/telemetry/js-probe', {

method: 'POST',

body: JSON.stringify({ url: location.href, ts: Date.now() })

});

Layer 3: Behavioural Fingerprinting

For LLM crawlers that do execute JavaScript, detection shifts to behavioural signals:

No mouse movement or touch events: Humans move their cursor before clicking. Headless browsers and agents typically trigger clicks without prior mouse movement events.
Abnormal timing patterns: Humans scroll, pause, read, scroll again. Crawlers that render pages typically scroll to a fixed depth immediately.
Navigator API anomalies: navigator.webdriver is true for browser automation frameworks unless explicitly patched.
Screen dimension consistency: Agentic browsers often run in default headless viewport sizes (1280x720, 1366x768) with unusual pixel ratios.

Layer 4: ML-Based Anomaly Detection

Individual signals can be individually weak — LLM crawlers are getting better at mimicking human behaviour. The strongest detection approach combines all signals into an ML model that:

Learns the normal distribution of your site's human traffic (not a generic model)
Identifies deviations from that distribution as anomaly scores
Updates continuously as new crawler patterns emerge — so detection improves as the threat evolves
Can cluster anomalous traffic into named patterns (e.g., "Perplexity fetch pattern," "headless Chrome agentic," "unknown retrieval crawler")

This is the core of WysLeap's auto-bot-discovery system — a self-learning detection layer that identifies new bot patterns from behavioural anomalies and continuously refines its accuracy through feedback.

What to Do Once You Can See Them

Detection is only the first step. What you do with LLM crawler traffic depends on what type it is and what your goals are. The right response is not always "block."

For Training Scrapers: Robots.txt + Licensing

If your primary concern is your content being used to train AI models without compensation, robots.txt directives for known crawlers (GPTBot, ClaudeBot, CCBot) are the first line of defence. Beyond that, content licensing discussions with AI companies are becoming increasingly viable — and knowing exactly which crawlers are accessing your content and how frequently puts you in a much stronger negotiating position.

For Real-Time Retrieval: Measure, Then Decide

Before blocking real-time retrieval crawlers, consider what they're actually doing. If Perplexity is fetching your pricing page ten times a day in response to user queries, those queries represent genuine commercial intent from humans who could become customers. Blocking the crawler may redirect that intent to a competitor's pricing page instead.

The more strategic move: optimise your content for AI retrieval, ensure your most important information is in the HTML (not behind JavaScript rendering), and monitor which pages get fetched most. Those pages are surfacing in AI answers — they deserve your attention.

For Agentic Browsers: Intent-Based Classification

As noted earlier, 87% of AI agent page visits are product-related — meaning they represent a human with genuine purchase intent who has delegated research to an AI assistant. Blanket blocking loses those leads.

A better approach: classify agentic traffic separately, exclude it from your human-audience metrics (so your analytics remain clean), but don't block it from accessing your content. Treat agentic visits as an engagement signal — if an AI agent is visiting your pricing page on someone's behalf, that someone is worth understanding.

The Metrics to Track

LLM crawler share of total traffic — trending up is expected, but sudden spikes may indicate targeted scraping
Top pages by LLM fetch volume — tells you which content is being surfaced in AI responses
Crawler type breakdown — training scrapers vs. real-time retrieval vs. agentic, treated differently
Human-only traffic trend — your actual audience, separated from machine traffic
New bot pattern emergence — novel traffic signatures you haven't seen before

Why Static Lists Will Always Lose — and What Wins Instead

The LLM crawler landscape changes every few months. New AI tools launch. Existing ones update their user agents. Enterprise customers build internal agents on top of public APIs. Open-source frameworks proliferate.

Any detection approach based on a static blocklist is already behind. By the time you add a new crawler signature to your filter, it's been joined by a dozen variants you haven't seen yet.

The Only Sustainable Approach: Behavioural ML That Adapts

A detection system that learns what your human traffic looks like, continuously updates that model, and identifies deviations — without needing to know the name of the crawler producing them — is fundamentally more durable than any static approach.

This is why WysLeap's bot detection is built on adaptive ML rather than blocklists. When a new crawler pattern appears — even one we've never seen before — the anomaly detection flags it, the pattern is learned, and the model improves. No manual update required. No lag between new threat and detection.

The Three-Layer Architecture That Works

Server-side collection — capture all traffic before any JavaScript runs, so nothing is invisible by default
Multi-signal classification — combine header analysis, JS execution probes, and behavioral fingerprinting to classify each visitor
Self-learning anomaly detection — continuously identify new patterns, update the model, and improve accuracy without manual intervention

The result is a system that gets better at detecting LLM crawlers over time — not worse, as blocklists become stale.

The Taxonomy Cheat Sheet

Crawler Type	Detectable by UA?	Runs JS?	Risk to Metrics	Recommended Action
Training scrapers (GPTBot, CCBot)	Often yes	No	Content theft	Block via robots.txt
Real-time retrieval (Perplexity, Bing AI)	Sometimes	Rarely	Invisible in analytics	Measure, then decide; optimise content for AI
Custom agents (Playwright, browser-use)	Rarely	Sometimes	Corrupts metrics	ML behavioural detection required
Agentic browsers (Comet, Gemini)	No	Yes	Inflates engagement	Classify separately; don't block

See Every Bot That's Hitting Your Site

WysLeap's self-learning bot detection automatically discovers new patterns — including LLM crawlers your current filters have never seen. Get clean human-only metrics alongside full bot visibility, with zero manual configuration.

Start Free — See Your Bot Traffic Read: The Invisible Audience

Siva J.P.

Privacy Research Lead at WysLeap

February 8, 2026

11 min read