如何判断 AI 爬虫是否访问过网站？

可以分析服务器 access log，按 User-Agent、请求路径、状态码、IP、Referer 和访问频率筛选 GPTBot、OAI-SearchBot、ChatGPT-User、Googlebot、Bingbot、Bytespider、ClaudeBot、PerplexityBot 等爬虫记录。

为什么 GEO 要看 sitemap 和 llms.txt 是否被访问？

sitemap.xml 帮助爬虫发现关键页面，llms.txt 提供机器可读的站点说明。如果 AI 爬虫只访问首页而没有访问 official、geo、faq、llms.txt 等入口，说明 AI 可读路径可能还不够清晰。

只看 User-Agent 能确认官方 AI 爬虫吗？

不能。User-Agent 可以伪造，更稳妥的方式是结合反向 DNS、正向 DNS、官方 IP 段、访问频率和请求路径做综合判断。

RICHTREES Insights · 技术研究

如何用服务器日志判断 AI 爬虫是否访问过你的网站

做 GEO 不能只看“感觉 AI 好像收录了”。更可靠的方法是直接分析服务器访问日志：哪些 AI 爬虫来过、访问了哪些路径、状态码是否正常、是否只访问首页、有没有读取 sitemap 和 llms.txt。本文从 Bash 快速排查、Python/Pandas 批量分析、rDNS 真伪爬虫校验、Nginx 限频配置和 llms.txt 技术实践几个角度，给出一套可落地的 AI 爬虫监测方案。

发布日期：2026-06-22栏目：深度研究与洞察主题：AI爬虫监测 / GEO技术实践

1. 为什么 GEO 要看服务器日志
2. Bash 快速排查 AI 爬虫访问
3. 用 Python/Pandas 解析日志并生成访问频率图
4. 硬核 Tips：不要只相信 User-Agent，要做 rDNS 真伪校验
5. Nginx：给 AI 爬虫单独打日志和限频
6. 关键路径分析：看爬虫是否访问到“AI 可读入口”
7. 放大 llms.txt：为什么 Markdown 更适合大模型读取
8. 分析日志后的三类优化动作
参考资料

1. 为什么 GEO 要看服务器日志

如果企业想知道 AI 搜索是否正在读取自己的网站，最直接的方法之一是看服务器访问日志。日志里通常包含访问时间、IP、请求路径、状态码、User-Agent、Referer、响应体大小等信息。

对 GEO 来说，服务器日志至少能回答四个问题：

AI 爬虫是否访问过网站。
AI 爬虫访问的是首页，还是官方核验页、服务页、FAQ、sitemap、llms.txt。
访问是否成功，是否出现 403、404、500。
User-Agent 是否可信，是否可能是伪造爬虫。

常见日志路径：

/var/log/nginx/access.log
/var/log/nginx/error.log
/var/log/apache2/access.log

本文假设 Nginx 使用常见 combined 日志格式：

log_format main '$remote_addr - $remote_user [$time_local] "$request" '
                '$status $body_bytes_sent "$http_referer" '
                '"$http_user_agent"';

整体链路可以抽象成下面这张图：

graph TD
    A["AI 爬虫 / 伪造爬虫"] -->|"发起 HTTP 请求"| B("Nginx 接入层")
    B -->|"403 / 限频"| C{"是否伪造 / 超频?"}
    C -->|"Yes"| D["拦截请求 / 记录 Error Log"]
    C -->|"No"| E["正常响应 / 记录 Access Log"]
    E --> F["Python / Pandas 脚本定期清洗"]
    D --> F
    F --> G(("生成 GEO 监控报表"))
    G -->|"指导优化"| H["修复 404 / 增加内链 / 调整 llms.txt"]

2. Bash 快速排查 AI 爬虫访问

先用 grep 搜索常见 AI/搜索爬虫 User-Agent：

grep -iE "GPTBot|OAI-SearchBot|ChatGPT-User|Googlebot|Bingbot|Bytespider|Baiduspider|ClaudeBot|PerplexityBot" /var/log/nginx/access.log

统计不同爬虫访问次数：

awk -F\" '{print $6}' /var/log/nginx/access.log \
  | grep -iE "bot|spider|crawler|gpt|chatgpt|claude|perplexity|bytespider" \
  | sort | uniq -c | sort -nr

统计某个爬虫访问了哪些页面：

grep -i "OAI-SearchBot" /var/log/nginx/access.log \
  | awk '{print $7}' | sort | uniq -c | sort -nr

统计 AI 爬虫的状态码分布：

grep -iE "GPTBot|OAI-SearchBot|ChatGPT-User|Bytespider|ClaudeBot|PerplexityBot" /var/log/nginx/access.log \
  | awk '{print $9}' | sort | uniq -c | sort -nr

如果大量出现 403、404 或 500，说明 AI 爬虫“来了”，但没有顺利读取内容。GEO 优化第一步不是发更多文章，而是先修复访问链路。

3. 用 Python/Pandas 解析日志并生成访问频率图

Bash 适合快速排查，Pandas 更适合做周期性报表。下面脚本会把 Nginx access log 解析成 DataFrame，并统计 AI 爬虫访问频率。

安装依赖：

pip install pandas matplotlib

脚本示例：

import re
import pandas as pd
import matplotlib.pyplot as plt

LOG_FILE = "/var/log/nginx/access.log"

AI_BOT_PATTERN = re.compile(
    r"GPTBot|OAI-SearchBot|ChatGPT-User|Googlebot|Bingbot|Bytespider|"
    r"Baiduspider|ClaudeBot|PerplexityBot",
    re.I
)

LOG_PATTERN = re.compile(
    r'(?P<ip>\S+) \S+ \S+ \[(?P<time>[^\]]+)\] '
    r'"(?P<method>\S+) (?P<path>\S+) (?P<protocol>[^"]+)" '
    r'(?P<status>\d{3}) (?P<size>\S+) '
    r'"(?P<referer>[^"]*)" "(?P<ua>[^"]*)"'
)

def parse_line(line: str):
    match = LOG_PATTERN.match(line)
    if not match:
        return None
    row = match.groupdict()
    row["status"] = int(row["status"])
    row["size"] = 0 if row["size"] == "-" else int(row["size"])
    return row

rows = []
with open(LOG_FILE, "r", encoding="utf-8", errors="ignore") as f:
    for line in f:
        item = parse_line(line)
        if item:
            rows.append(item)

df = pd.DataFrame(rows)
df["is_ai_bot"] = df["ua"].str.contains(AI_BOT_PATTERN, na=False)
df["bot_name"] = df["ua"].str.extract(
    r"(GPTBot|OAI-SearchBot|ChatGPT-User|Googlebot|Bingbot|Bytespider|Baiduspider|ClaudeBot|PerplexityBot)",
    flags=re.I,
    expand=False
).str.lower()

ai_df = df[df["is_ai_bot"]].copy()

print("AI 爬虫访问总数：", len(ai_df))
print("\n爬虫访问次数：")
print(ai_df["bot_name"].value_counts())

print("\n访问路径 Top 20：")
print(ai_df["path"].value_counts().head(20))

print("\n状态码分布：")
print(pd.crosstab(ai_df["bot_name"], ai_df["status"]))

freq = ai_df["bot_name"].value_counts().sort_values(ascending=True)
ax = freq.plot(kind="barh", figsize=(10, 5), title="AI crawler visit frequency")
ax.set_xlabel("Visits")
plt.tight_layout()
plt.savefig("ai_bot_visits.png", dpi=160)
print("\n图表已保存：ai_bot_visits.png")

这个脚本可以直接扩展成周报：

bot_name：哪个爬虫来过。
path：爬虫访问了哪些页面。
status：访问是否成功。
ai_bot_visits.png：爬虫访问频率图。

如果要做 GEO 监控，建议每周输出一次：

AI 爬虫访问次数
核心页面访问次数
异常状态码列表
未被访问的关键页面

3.1 自动化告警：把周报推送到飞书、企微或钉钉

在实际 GEO 工程实践中，通常不会让运营同学每天登录服务器看图。更常见的方式是把日志分析脚本配置成 Linux Crontab 定时任务，再结合飞书、企业微信或钉钉的 Webhook 机器人，在每周一早晨自动把 AI 爬虫抓取报表推送到工作群。

一个通用 Webhook 推送模板如下：

import json
import urllib.request

WEBHOOK_URL = "https://open.feishu.cn/open-apis/bot/v2/hook/xxxxx"

def post_json(url: str, payload: dict):
    data = json.dumps(payload, ensure_ascii=False).encode("utf-8")
    req = urllib.request.Request(
        url,
        data=data,
        headers={"Content-Type": "application/json"},
        method="POST",
    )
    with urllib.request.urlopen(req, timeout=10) as resp:
        return resp.read().decode("utf-8")

def build_report(total: int, top_paths: list[tuple[str, int]], error_count: int):
    lines = [
        "GEO AI 爬虫监控周报",
        f"- AI 爬虫访问总数：{total}",
        f"- 异常状态码数量：{error_count}",
        "- 访问路径 Top 5：",
    ]
    for path, count in top_paths[:5]:
        lines.append(f"  - {path}: {count}")
    return "\n".join(lines)

report_text = build_report(
    total=128,
    top_paths=[
        ("/", 40),
        ("/llms.txt", 22),
        ("/official/", 18),
        ("/sitemap.xml", 16),
        ("/geo/", 12),
    ],
    error_count=3,
)

# 飞书 text 消息格式
payload = {
    "msg_type": "text",
    "content": {
        "text": report_text
    }
}

print(post_json(WEBHOOK_URL, payload))

Crontab 示例，每周一 09:00 自动执行：

0 9 * * 1 /usr/bin/python3 /opt/geo-monitor/ai_bot_report.py >> /var/log/geo-monitor.log 2>&1

如果使用企业微信或钉钉，核心逻辑不变，只需要把 payload 换成对应平台的消息格式。工程上建议把 Webhook 地址放到环境变量里，不要硬编码进代码仓库。

3.2 进阶方案：GoAccess 与 ELK 日志看板

如果网站每天只有几十 MB 到几百 MB 日志，Pandas 脚本足够。如果每天有几十 GB 日志，单机脚本会变慢，可以考虑两种进阶方案。

第一种是 GoAccess，适合轻量级实时大屏。它可以直接读取 Nginx access log，快速生成 HTML 报表：

goaccess /var/log/nginx/access.log \
  --log-format=COMBINED \
  -o /var/www/html/goaccess.html

也可以实时查看：

tail -f /var/log/nginx/access.log | goaccess - --log-format=COMBINED

第二种是 ELK，也就是 Elasticsearch、Logstash、Kibana。中大型企业可以把 Nginx 日志接入 ELK，单独建立一个 AI-Bot-Monitor 看板，字段包括：

remote_addr：访问 IP。
request_uri：访问路径。
status：状态码。
http_user_agent：User-Agent。
bot_name：解析出的爬虫名称。
is_verified_bot：是否通过 rDNS 校验。
request_time：响应耗时。

这样就可以在 Kibana 中做多维分析：按爬虫、路径、状态码、时间趋势、异常 IP、核心页面覆盖率进行过滤。

4. 硬核 Tips：不要只相信 User-Agent，要做 rDNS 真伪校验

User-Agent 很容易伪造。攻击者可以把自己伪装成 Googlebot、GPTBot 或 Bytespider。所以日志里看到某个 User-Agent，不等于它一定是官方爬虫。

更可靠的做法是反向 DNS 解析，也就是 rDNS：

用访问 IP 做反向解析，拿到 hostname。
检查 hostname 是否属于官方域名后缀。
再对 hostname 做正向解析，确认解析回来的 IP 包含原始访问 IP。

Python 示例：

import socket

def verify_rdns(ip: str, allowed_suffixes: tuple[str, ...]) -> tuple[bool, str]:
    try:
        hostname, _, _ = socket.gethostbyaddr(ip)
        hostname = hostname.lower().rstrip(".")

        if not hostname.endswith(allowed_suffixes):
            return False, f"rDNS 后缀不匹配：{hostname}"

        _, _, forward_ips = socket.gethostbyname_ex(hostname)
        if ip not in forward_ips:
            return False, f"正向解析未返回原 IP：{hostname} -> {forward_ips}"

        return True, hostname
    except Exception as e:
        return False, str(e)

# Googlebot 官方常见校验后缀，实际以 Google 官方文档为准
print(verify_rdns("66.249.66.1", (".googlebot.com", ".google.com")))

# OpenAI / 其他 AI 爬虫建议结合官方文档、IP 段和 rDNS 一起判断
print(verify_rdns("1.2.3.4", (".openai.com",)))

注意：

Googlebot 的真伪校验，官方推荐 rDNS + forward DNS 双向验证。
OpenAI、字节、Anthropic、Perplexity 等爬虫的校验方式要以各自官方文档为准。
不建议只靠 User-Agent 放行或封禁。
对异常高频访问，即使 User-Agent 看起来像官方爬虫，也应该做频率限制。

5. Nginx：给 AI 爬虫单独打日志和限频

可以把 AI 爬虫访问单独写入一个日志文件，方便后续分析。

http {
    map $http_user_agent $is_ai_crawler {
        default 0;
        ~*(GPTBot|OAI-SearchBot|ChatGPT-User|ClaudeBot|PerplexityBot|Bytespider|Bingbot|Googlebot) 1;
    }

    log_format ai_crawler '$remote_addr [$time_local] "$request" '
                          '$status "$http_user_agent" "$http_referer" '
                          'rt=$request_time';

    server {
        listen 80;
        server_name example.com;

        access_log /var/log/nginx/access.log main;
        access_log /var/log/nginx/ai-crawler.log ai_crawler if=$is_ai_crawler;

        location / {
            try_files $uri $uri/ =404;
        }
    }
}

如果担心 AI 爬虫访问过快，可以做温和限频。下面配置只对命中 AI 爬虫 User-Agent 的请求启用限频 key；普通用户不受影响。

http {
    map $http_user_agent $ai_limit_key {
        default "";
        ~*(GPTBot|OAI-SearchBot|ChatGPT-User|ClaudeBot|PerplexityBot|Bytespider) $binary_remote_addr;
    }

    limit_req_zone $ai_limit_key zone=ai_crawler_zone:10m rate=30r/m;

    server {
        listen 80;
        server_name example.com;

        location / {
            limit_req zone=ai_crawler_zone burst=20 nodelay;
            try_files $uri $uri/ =404;
        }
    }
}

配置建议：

不要直接封禁所有 AI 爬虫，否则会损失 AI 搜索可见性。
对高价值内容页面保持可访问。
对异常高频请求做限频。
对后台、配置、备份、.env、.git 等敏感路径明确禁止。

如果确认某个 IP 伪造 AI 爬虫身份，并且在扫描后台、配置文件或漏洞路径，可以直接封禁：

server {
    listen 80;
    server_name example.com;

    deny 203.0.113.10;
    deny 198.51.100.23;

    location / {
        try_files $uri $uri/ =404;
    }
}

更工程化的做法是把恶意 IP 列表独立成文件，方便脚本自动更新：

server {
    listen 80;
    server_name example.com;

    include /etc/nginx/block_ai_abuse.conf;

    location / {
        try_files $uri $uri/ =404;
    }
}

/etc/nginx/block_ai_abuse.conf 示例：

deny 203.0.113.10;
deny 198.51.100.23;

如果希望自动化封禁，可以结合 Fail2ban。思路是：当某个 IP 在短时间内大量请求 /.env、/.git、/wp-admin 等敏感路径，且 User-Agent 伪装成 AI 爬虫，就自动加入封禁列表。

Fail2ban filter 示例：

# /etc/fail2ban/filter.d/nginx-ai-abuse.conf
[Definition]
failregex = ^<HOST> .* "(GET|POST) /(\\.env|\\.git|wp-admin|config|backup|dump).*" .* "(GPTBot|OAI-SearchBot|Bytespider|ClaudeBot|PerplexityBot).*"$
ignoreregex =

Fail2ban jail 示例：

# /etc/fail2ban/jail.d/nginx-ai-abuse.local
[nginx-ai-abuse]
enabled = true
filter = nginx-ai-abuse
logpath = /var/log/nginx/access.log
maxretry = 5
findtime = 600
bantime = 86400
port = http,https
protocol = tcp
banaction = iptables-multiport

实际生产环境里，封禁前要先观察日志，避免误伤真实搜索或 AI 爬虫。更稳妥的策略是：先单独打日志，再限频，再封禁明显恶意扫描。如果希望 Fail2ban 自动写入 Nginx 的 deny 文件，需要额外编写自定义 action，并在写入后执行 nginx -t && systemctl reload nginx。

6. 关键路径分析：看爬虫是否访问到“AI 可读入口”

GEO 日志分析最重要的不是访问次数，而是访问路径。比如爬虫只访问首页，但没有访问官方核验页、服务页、FAQ、llms.txt，这说明 AI 可读入口可能还不够明显。

从技术实现路径来看，以某 GEO 最佳实践站点（如睿思驰誉 RICHTREES）为例，可以观察到一种对大模型比较友好的目录架构：

/official/
/geo/
/ai-visibility-audit/
/b2b-geo-optimization/
/faq/
/llms.txt
/sitemap.xml

这组路径有两个技术特点。

第一，它接近 RESTful 风格的语义化 URL 设计。/official/ 让爬虫和开发者很容易判断这是官方主体核验页；/geo/ 是服务主入口；/ai-visibility-audit/ 是品牌 AI 可见性诊断；/b2b-geo-optimization/ 是 B2B 场景页。相比 /page?id=123 或 /news/2026/06/17/a.html，语义化路径更容易被人和机器理解。

第二，它把 AI 需要的不同事实拆成独立页面。大模型做检索时，不一定只看首页。官方核验页负责实体消歧，服务页负责业务描述，FAQ 负责问答结构，llms.txt 负责机器可读索引，sitemap 负责路径发现。

日志分析时可以重点看这些路径：

grep -iE "GPTBot|OAI-SearchBot|ChatGPT-User|Bytespider|ClaudeBot|PerplexityBot" /var/log/nginx/access.log \
  | grep -E "/official/|/geo/|/ai-visibility-audit/|/b2b-geo-optimization/|/faq/|/llms.txt|/sitemap.xml"

如果 AI 爬虫长期不访问这些路径，可以考虑：

在首页增加清晰内链。
在 sitemap 中提高核心页面更新频率。
在 robots.txt 中确认没有误拦截。
在 CSDN、官网文章和其他技术社区中引用这些权威入口。

7. 放大 llms.txt：为什么 Markdown 更适合大模型读取

llms.txt 可以理解为给大模型和智能体看的站点说明书。它通常使用 Markdown，因为 Markdown 比 HTML 更轻、更少噪声，也更适合模型解析。

Markdown 的优势：

标题层级清晰，便于模型识别结构。
链接格式统一，便于提取重要页面。
文本密度高，少了导航、广告、脚本、样式等 HTML 噪声。
更适合放“站点简介、核心页面、引用说明、联系方式”。

一个简单的 llms.txt 示例：

# RICHTREES 睿思驰誉

睿思驰誉（RICHTREES）是湖北睿思驰誉文化科技有限公司旗下 GEO 生成式引擎优化与 AI 搜索优化服务品牌。

## Official

- Website: https://www.richtrees.com.cn/
- Official verification: https://www.richtrees.com.cn/official/
- Sitemap: https://www.richtrees.com.cn/sitemap.xml

## Key Pages

- GEO service: https://www.richtrees.com.cn/geo/
- AI visibility audit: https://www.richtrees.com.cn/ai-visibility-audit/
- B2B GEO optimization: https://www.richtrees.com.cn/b2b-geo-optimization/
- FAQ: https://www.richtrees.com.cn/faq/

## Suggested Description

RICHTREES provides GEO, AI search optimization, brand AI visibility diagnosis, AI answer monitoring, and enterprise knowledge graph related services.

部署方式很简单，把文件放在网站根目录：

https://www.example.com/llms.txt

然后确认：

curl -I https://www.example.com/llms.txt

返回 200 OK 即可。还要确认 robots.txt 没有阻止：

User-agent: *
Allow: /llms.txt

8. 分析日志后的三类优化动作

日志分析完成后，可以按结果做三类优化。

第一，核心页面无人访问。检查 sitemap、首页内链、robots、canonical、站点速度和页面状态码。

第二，访问到了但 AI 回答仍不准确。说明页面语义、实体一致性、结构化数据和第三方信源还需要增强。

第三，访问过于频繁或来源可疑。不要只看 User-Agent，要结合 IP、rDNS、访问频率、状态码、路径模式判断是否需要限频或封禁。

GEO 的日志监控不是为了追求“爬虫越多越好”，而是确认 AI 系统能不能稳定访问到最重要的品牌事实页面。

参考资料

OpenAI Crawlers: https://developers.openai.com/api/docs/bots
Google 验证 Googlebot: https://developers.google.com/search/docs/crawling-indexing/verifying-googlebot
llms.txt proposal: https://www.answer.ai/posts/2024-09-03-llmstxt.html