Nginx 日志中的各种爬虫汇总和屏蔽操作

1. 介绍

通过对Nginx日志中抓取到的请求信息的梳理。列一些爬虫数据，以及攻击者的信息。

2. 爬虫

将会根据nginx日志中的$http_user_agent字段进行分类展示，并介绍。

如果对于Nginx日志不太了解，可以参考我的文章：https://zinyan.com/?p=444 了解一下。

如果爬虫工具访问了robots.txt文件。就代表该爬虫遵守robots协议。我们可以在Robots中添加配置，告诉爬虫本网站禁止爬取。

PS：robots.txt 文件配置的只是一个约定，并没有强制性哦。

2.1 serpstatbot 爬虫

UserAgent 信息为：serpstatbot/2.1 (advanced backlink tracking bot; https://serpstatbot.com/; abuse@serpstatbot.com)

ip地址：5.9.55.228

这是一家国外的网络爬虫工具，会抓取我们网站的SEO信息。

官方介绍说，如果不想被serpstatbot抓取，可以在根目录下创建robots.txt 文件。在文件中填写。

User-agent: serpstatbot
Disallow: /

那么这个爬虫就不会再访问我们的网站了

官网为：https://serpstatbot.com/

2.2 bing 爬虫

UserAgent 信息为：Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) Chrome/103.0.5060.134 Safari/537.36

或：Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)

ip地址：157.55.39.80，157.55.39.201

这个是Bing搜索的官方爬虫

2.3 阿里云态势感知

UserAgent信息为：Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.6.2333.33 Safari/537.36 AliyunTaiShiGanZhi https://www.aliyun.com/product/sas

通常会扫码查询：

POST /inc/td_core.php HTTP/1.1
GET /inc/td_core.php HTTP/1.1

POST /ispirit/im/upload.php HTTP/1.1
GET /ispirit/im/upload.php HTTP/1.1
POST /ispirit/im/upload.php HTTP/1.1
POST /rapi/filedownload?filter=path:%2fusr%2fshare%2fzoneinfo%2fzone.tab HTTP/1.1
POST /pcidss/report?type=allprofiles&sid=loginchallengeresponse1requestbody&username=nsroot&set=1
GET /pcidss/report?type=allprofiles&sid=loginchallengeresponse1requestbody&username=nsroot&set=1 HTTP/1.1
...

ip地址为：47.110.180.50，47.110.180.32，47.110.180.55，47.110.180.42，47.110.180.40，47.110.180.47 ，47.110.180.52，47.110.180.60，47.110.180.46，47.110.180.38，47.110.180.63，47.110.180.57等等

PS：百度说是阿里云的态势感知扫描，我通过工单咨询后，阿里售后工程师告诉我上面的并不是阿里云的漏洞扫描。可以屏蔽访问。

通过屏蔽：47.110.180.0/224 屏蔽这个网段。

2.4 Google 爬虫

UserAgent信息为：Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)，Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.5304.110 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

ip地址为：66.249.77.63 ，66.249.77.34

2.5 百度爬虫

UserAgent信息为：Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)

ip地址为：116.179.32.109，220.181.108.91，116.179.37.*

2.6 SeznamBot 爬虫

UserAgent信息：Mozilla/5.0 (compatible; SeznamBot/3.2; +http://napoveda.seznam.cz/en/seznambot-intro/)，Mozilla/5.0 (compatible; SeznamBot/4.0-RC1; +http://napoveda.seznam.cz/seznambot-intro/)

ip地址为：77.75.76.166，77.75.79.31

和serpostatbot一样，属于国外的爬虫工具，可以屏蔽。

2.7 YisouSpider 爬虫

UserAgent：YisouSpider 或者Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.81 YisouSpider/5.0 Safari/537.36

ip地址为： 101.67.49.72，39.173.105.172，60.188.10.170，101.67.49.191 ，112.13.112.104，112.13.112.139等信息。

这个是神马搜索的爬虫。

2.8 头条爬虫

今日头条的爬虫信息。

UserAgent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181;Bytespider;https://zhanzhang.toutiao.com/ 或者Mozilla/5.0 (Linux; Android 5.0) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; Bytespider; https://zhanzhang.toutiao.com/)

ip地址为：110.249.202.37 ，111.225.148.58

2.9 PetalBot 爬虫

这个是华为花瓣搜索爬虫。

UserAgent：Mozilla/5.0 (Linux; Android 7.0;) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; PetalBot;+https://webmaster.petalsearch.com/site/petalbot)

Ip地址为：114.119.159.226，114.119.153.191

2.10 YandexBot 爬虫

这个是俄罗斯Yandex搜索引擎的爬虫工具。

UserAgent：Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)

IP地址为：5.255.253.111

遵守：robots协议

2.11 AhrefsBot 爬虫

国外营销网站爬虫。可以屏蔽掉

UserAgent:Mozilla/5.0 (compatible; AhrefsBot/7.0; +http://ahrefs.com/robot/)

IP地址为：51.222.253.2，51.222.253.6

遵守：robots协议

2.12 Expanse 扫描

Palo Alto Networks公司旗下的Expanse 会通过ip地址扫描我们的服务器

UserAgent:Expanse, a Palo Alto Networks company, searches across the global IPv4 space multiple times per day to identify customers' presences on the Internet. If you would like to be excluded from our scans, please send IP addresses/domains to: scaninfo@paloaltonetworks.com

ip地址为：205.210.31.132，198.235.24.29

2.13 Sogou 爬虫

UserAgent：Sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)

IP地址为：58.250.125.82

2.14 MJ12Bot 爬虫

国外的SEO分析爬虫，和SemrushBot类似。我们如果不面对国外客户。完全可以屏蔽。

UserAgent：Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/)

ip地址为：173.212.245.225

PS：

1.以上只是部分爬虫，还有很多这里没有做太多展示。

2.上面的Ip地址仅供参考。因为ip地址可以变动。

3. 漏洞扫描

列一些非法的爬虫或者攻击操作。大家根据自己的情况选择，进行屏蔽。

3.1 python攻击

UserAgent 信息为："Python-urllib/3.8"

说它是攻击因为它访问的$request信息不是正常的地址而是类似下面的路径

GET /remote/fgt_lang?lang=/../../../..//////////dev/cmdb/sslvpn_websession HTTP/1.1

ip地址：185.7.214.218

3.2 Zgrab 扫描器

UserAgent：Mozilla/5.0 zgrab/0.x

这是 zgrab扫描器，用于快速获取应用返回包，zmap旗下产品。

4. 过滤

4.1 Nginx 过滤配置

我们可以通过Nginx 屏蔽一些非法的访问，以及一些国外爬虫工具的访问。可以减少服务器的内存压力。

直接在nginx.conf 配置文件中添加过滤即可,示例效果如下：

禁止UA中携带以下关键字，或者为空的对象访问。（大小写无所谓，但是我习惯了写一些大小写）

# 如果是空user——agent 直接返回444
if ($http_user_agent ~ ^$){
return 444;
    }
#userAgent中如果包含下面的关键字，直接返回444
if ($http_user_agent ~* "Scrapy|python|curl|wget|httpclient|MJ12bot|Expanse|ahrefsbot|seznambot|serpstatbot|sindresorhus|zgrab"){
return 444;
    }

按照正则进行的匹配。

上面的配置，可以直接放在 server{} 之中，也可以放在 location{} 之中。

中间一度出现了以下的错误：

[root@iZuf conf.d]# nginx -t
nginx: [emerg] unknown directive "if($http_user_agent" in /etc/nginx/nginx.conf:40
nginx: configuration file /etc/nginx/nginx.conf test failed

百度一圈说我可能没有安装 rewrite模块，吓我一跳。不可能啊，安装nginx过程中又没有出现过错误。

而且rewrite模块是属于nginx默认组件不应该没有安装啊。

最后检查发现需要在if命令后添加空格。也就是说不能直接写成 if($http_user_aget... 需要写成 if ($http_user_aget...。

这个问题，简直让人无语哦。更多的配置可以参考nginx官网：http://nginx.org/en/docs/http/ngx_http_rewrite_module.html 中的介绍。

nginx -t 检测通过后，通过service nginx reload刷新配置就可以生效了。

例如我配置完毕后在我的log中可以看到：(PS:下面的log日志的格式，可以参考https://zinyan.com/?p=444 中的介绍)

Status:444,Bytes:0,IP:135.181.74.243,Time:[2022-11-18T04:32:04+08:00],Request:"GET /robots.txt HTTP/1.1" ,Referer:"-",UserAgent:"Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/)",RequestTime:[0.000]

Status:444,Bytes:0,IP:51.222.253.13,Time:[2022-11-18T03:36:57+08:00],Request:"GET /tags/sdk HTTP/1.1" ,Referer:"-",UserAgent:"Mozilla/5.0 (compatible; AhrefsBot/7.0; +http://ahrefs.com/robot/)",RequestTime:[0.000]

可以看到，我返回的Bytes字节为0。说明屏蔽设置成功了。

4.2 IP屏蔽

用阿里云做示例，可以在ECS服务器下面的网络安全组中，针对具体的IP做禁止访问限制。添加完毕后，该Ip将无法访问我们的服务器，也就无法访问网站发起攻击了。

可以用这个方法，屏蔽国外的一些IP访问。

其实，是否屏蔽，对于我们普通的博客站来说，并没有太多的意义。一些想攻击的，可以很容易绕过上面的屏蔽。主要就是屏蔽一下虚假的请求而已。

屏蔽各种扫描工具和国外SEO爬虫的抓取。减少点请求压力而已。

目录CONTENT