利用Simhash做URL去重的实现方式

利用Simhash做URL去重的实现方式 – Hi Faye Open Menu

利用Simhash做URL去重的实现方式

14 Jul 2017

Reading time ~2 minutes

去年写过一个爬虫,给定一个域名后,递归爬取该域名及其子域名下的所有URL,这在渗透测试时,对于了解该目标的资产信息有很大帮助,可以爬到很多在线使用的业务等。所有这类的爬虫大概都会遇到一个问题,就是URL去重,大量相似的URL,全部抓取非常耗费资源,并且没有价值。去年没有解决这个问题,现在在做IDS的URL分析时,想到了一种解决这个问题的思路,所以补充了一下代码,代码地址: https://github.com/nobleXu/TinyWebSpider

实现思路

爬虫的环境是很复杂的,你完全不知道到底有多少URL,也不知道你爬到的URL会有多少种格式。所以没有办法对所有URL做特定的分类和分析,只能写一些通用的规则,代码是去年写的了,比较乱,主要加了一个URL比较的函数。下面具体说一下我是如何对IDS抓到的URL做去重的,因为这些URL相对来说会有规律的多。

简单的一个例子,下面我贴出一个域名下的一些URL,并一步步阐述对他们做去重的具体步骤。

拿去哪网的几个URL为例:

http://tuan.qunar.com/?in_track=home_tuan_content&list=gengduo
http://tuan.qunar.com/?in_track=home_tuan_content&tag=jiage_xiaoyu50
http://tuan.qunar.com/?in_track=home_tuan_content&tag=jiage_50dao100
http://tuan.qunar.com/?in_track=home_tuan_content&tag=jiage_100dao150
http://tuan.qunar.com/?in_track=home_tuan_content&tag=jiage_150dao200
http://tuan.qunar.com/?in_track=home_tuan_content&tag=jiage_200dao500
http://tuan.qunar.com/?in_track=home_tuan_content&tag=jiage_dayu500
http://tuan.qunar.com/?in_track=home_tuan_content&tag=pinpai_yibai
http://tuan.qunar.com/?in_track=home_tuan_content&tag=pinpai_buding
http://tuan.qunar.com/?in_track=home_tuan_content&tag=pinpai_99liansuo
http://tuan.qunar.com/?in_track=home_tuan_content&tag=pinpai_weiyena
http://tuan.qunar.com/?in_track=home_tuan_content&tag=pinpai_jinjiangzhixing
http://tuan.qunar.com/?in_track=home_tuan_content&tag=pinpai_gelinhaotai
http://tuan.qunar.com/?in_track=home_tuan_content&tag=pinpai_jitai
http://tuan.qunar.com/?in_track=home_tuan_content&tag=pinpai_su8
http://tuan.qunar.com/?in_track=home_tuan_content&tag=pinpai_quanjijiudian
http://tuan.qunar.com/?in_track=home_tuan_content&tag=pinpai_weiyenaguoji
http://tuan.qunar.com/?in_track=home_tuan_content&tag=pinpai_guojiqingnianlvshe
http://tuan.qunar.com/?in_track=home_tuan_content&tag=pinpai_rujiaheyi
http://tuan.qunar.com/?in_track=home_tuan_content&tag=pinpai_yibisi
http://tuan.qunar.com/?in_track=home_tuan_content&tag=pinpai_fuhao
http://tuan.qunar.com/?in_track=home_tuan_content&tag=leixing_jingjixing
http://tuan.qunar.com/?in_track=home_tuan_content&tag=leixing_shushixing
http://tuan.qunar.com/?in_track=home_tuan_content&tag=leixing_zhutijiudian
http://tuan.qunar.com/?in_track=home_tuan_content&tag=leixing_gaodanghaohua
http://tuan.qunar.com/?in_track=home_tuan_content&tag=leixing_liansuopinpai
http://tuan.qunar.com/?in_track=home_tuan_content&tag=leixing_dujiajiudian
http://tuan.qunar.com/?in_track=home_tuan_content&tag=leixing_gongyuxingjiudian
http://tuan.qunar.com/?in_track=home_tuan_content&tag=leixing_qingnianlvshe
http://tuan.qunar.com/?in_track=home_tuan_content&tag=leixing_kezhan
http://tuan.qunar.com/ext/sact/RjABVv?in_track=home_tuan_content_lunbo
http://tuan.qunar.com/hotel/shanghai_city_21580?in_track=home_tuan_content&list=rexiaojingxuan
http://tuan.qunar.com/hotel/shanghai_city_16650?in_track=home_tuan_content&list=rexiaojingxuan
http://tuan.qunar.com/hotel/shanghai_city_7888?in_track=home_tuan_content&list=rexiaojingxuan
http://tuan.qunar.com/hotel/shanghai_city_1991?in_track=home_tuan_content&list=rexiaojingxuan

第一步:泛化

解析URL的每个参数,把每个参数的值做泛化。具体代码:

self.Chars = [',', '-', '_']
def etl(self, str):
        chars = ""
        for c in str:
            c = c.lower()
            if ord('a') <= ord(c) <= ord('z'):
                chars += 'A'
            elif ord('0') <= ord(c) <= ord('9'):
                chars += 'N'
            elif c in self.Chars:
                chars += 'T'
            else:
                chars += 'C'
        return chars

Chars 指定了一些参数内可能出现的符号。传入一个字符串,将里面的字母转化为A,数字转化为N,特殊符号转换为T,其他符号或者字符转化成C。 上述URL泛化之后的结果如下:

http://tuan.qunar.com/?in_track=home_tuan_content&list=gengduo
http://tuan.qunar.com/?list=AAAAAAA&in_track=AAAATAAAATAAAAAAA
http://tuan.qunar.com/?in_track=home_tuan_content&tag=jiage_xiaoyu50
http://tuan.qunar.com/?tag=AAAAATAAAAAANN&in_track=AAAATAAAATAAAAAAA
http://tuan.qunar.com/?in_track=home_tuan_content&tag=jiage_50dao100
http://tuan.qunar.com/?tag=AAAAATNNAAANNN&in_track=AAAATAAAATAAAAAAA
http://tuan.qunar.com/?in_track=home_tuan_content&tag=jiage_100dao150
http://tuan.qunar.com/?tag=AAAAATNNNAAANNN&in_track=AAAATAAAATAAAAAAA
http://tuan.qunar.com/?in_track=home_tuan_content&tag=jiage_150dao200
http://tuan.qunar.com/?tag=AAAAATNNNAAANNN&in_track=AAAATAAAATAAAAAAA
http://tuan.qunar.com/?in_track=home_tuan_content&tag=jiage_200dao500
http://tuan.qunar.com/?tag=AAAAATNNNAAANNN&in_track=AAAATAAAATAAAAAAA
http://tuan.qunar.com/?in_track=home_tuan_content&tag=jiage_dayu500
http://tuan.qunar.com/?tag=AAAAATAAAANNN&in_track=AAAATAAAATAAAAAAA
http://tuan.qunar.com/?in_track=home_tuan_content&tag=pinpai_yibai
http://tuan.qunar.com/?tag=AAAAAATAAAAA&in_track=AAAATAAAATAAAAAAA
http://tuan.qunar.com/?in_track=home_tuan_content&tag=pinpai_buding
http://tuan.qunar.com/?tag=AAAAAATAAAAAA&in_track=AAAATAAAATAAAAAAA
http://tuan.qunar.com/?in_track=home_tuan_content&tag=pinpai_99liansuo
http://tuan.qunar.com/?tag=AAAAAATNNAAAAAAA&in_track=AAAATAAAATAAAAAAA
http://tuan.qunar.com/?in_track=home_tuan_content&tag=pinpai_weiyena
http://tuan.qunar.com/?tag=AAAAAATAAAAAAA&in_track=AAAATAAAATAAAAAAA
http://tuan.qunar.com/?in_track=home_tuan_content&tag=pinpai_jinjiangzhixing
http://tuan.qunar.com/?tag=AAAAAATAAAAAAAAAAAAAAA&in_track=AAAATAAAATAAAAAAA
http://tuan.qunar.com/?in_track=home_tuan_content&tag=pinpai_gelinhaotai
http://tuan.qunar.com/?tag=AAAAAATAAAAAAAAAAA&in_track=AAAATAAAATAAAAAAA
http://tuan.qunar.com/?in_track=home_tuan_content&tag=pinpai_jitai
http://tuan.qunar.com/?tag=AAAAAATAAAAA&in_track=AAAATAAAATAAAAAAA
http://tuan.qunar.com/?in_track=home_tuan_content&tag=pinpai_su8
http://tuan.qunar.com/?tag=AAAAAATAAN&in_track=AAAATAAAATAAAAAAA
http://tuan.qunar.com/?in_track=home_tuan_content&tag=pinpai_quanjijiudian
http://tuan.qunar.com/?tag=AAAAAATAAAAAAAAAAAAA&in_track=AAAATAAAATAAAAAAA
http://tuan.qunar.com/?in_track=home_tuan_content&tag=pinpai_weiyenaguoji
http://tuan.qunar.com/?tag=AAAAAATAAAAAAAAAAAA&in_track=AAAATAAAATAAAAAAA
http://tuan.qunar.com/?in_track=home_tuan_content&tag=pinpai_guojiqingnianlvshe
http://tuan.qunar.com/?tag=AAAAAATAAAAAAAAAAAAAAAAAA&in_track=AAAATAAAATAAAAAAA
http://tuan.qunar.com/?in_track=home_tuan_content&tag=pinpai_rujiaheyi
http://tuan.qunar.com/?tag=AAAAAATAAAAAAAAA&in_track=AAAATAAAATAAAAAAA
http://tuan.qunar.com/?in_track=home_tuan_content&tag=pinpai_yibisi
http://tuan.qunar.com/?tag=AAAAAATAAAAAA&in_track=AAAATAAAATAAAAAAA
http://tuan.qunar.com/?in_track=home_tuan_content&tag=pinpai_fuhao
http://tuan.qunar.com/?tag=AAAAAATAAAAA&in_track=AAAATAAAATAAAAAAA
http://tuan.qunar.com/?in_track=home_tuan_content&tag=leixing_jingjixing
http://tuan.qunar.com/?tag=AAAAAAATAAAAAAAAAA&in_track=AAAATAAAATAAAAAAA
http://tuan.qunar.com/?in_track=home_tuan_content&tag=leixing_shushixing
http://tuan.qunar.com/?tag=AAAAAAATAAAAAAAAAA&in_track=AAAATAAAATAAAAAAA
http://tuan.qunar.com/?in_track=home_tuan_content&tag=leixing_zhutijiudian
http://tuan.qunar.com/?tag=AAAAAAATAAAAAAAAAAAA&in_track=AAAATAAAATAAAAAAA
http://tuan.qunar.com/?in_track=home_tuan_content&tag=leixing_gaodanghaohua
http://tuan.qunar.com/?tag=AAAAAAATAAAAAAAAAAAAA&in_track=AAAATAAAATAAAAAAA
http://tuan.qunar.com/?in_track=home_tuan_content&tag=leixing_liansuopinpai
http://tuan.qunar.com/?tag=AAAAAAATAAAAAAAAAAAAA&in_track=AAAATAAAATAAAAAAA
http://tuan.qunar.com/?in_track=home_tuan_content&tag=leixing_dujiajiudian
http://tuan.qunar.com/?tag=AAAAAAATAAAAAAAAAAAA&in_track=AAAATAAAATAAAAAAA
http://tuan.qunar.com/?in_track=home_tuan_content&tag=leixing_gongyuxingjiudian
http://tuan.qunar.com/?tag=AAAAAAATAAAAAAAAAAAAAAAAA&in_track=AAAATAAAATAAAAAAA
http://tuan.qunar.com/?in_track=home_tuan_content&tag=leixing_qingnianlvshe
http://tuan.qunar.com/?tag=AAAAAAATAAAAAAAAAAAAA&in_track=AAAATAAAATAAAAAAA
http://tuan.qunar.com/?in_track=home_tuan_content&tag=leixing_kezhan
http://tuan.qunar.com/?tag=AAAAAAATAAAAAA&in_track=AAAATAAAATAAAAAAA
http://tuan.qunar.com/ext/sact/RjABVv?in_track=home_tuan_content_lunbo
http://tuan.qunar.com/ext/sact/RjABVv?in_track=AAAATAAAATAAAAAAATAAAAA
http://tuan.qunar.com/hotel/shanghai_city_21580?in_track=home_tuan_content&list=rexiaojingxuan
http://tuan.qunar.com/hotel/shanghai_city_21580?list=AAAAAAAAAAAAAA&in_track=AAAATAAAATAAAAAAA
http://tuan.qunar.com/hotel/shanghai_city_16650?in_track=home_tuan_content&list=rexiaojingxuan
http://tuan.qunar.com/hotel/shanghai_city_16650?list=AAAAAAAAAAAAAA&in_track=AAAATAAAATAAAAAAA
http://tuan.qunar.com/hotel/shanghai_city_7888?in_track=home_tuan_content&list=rexiaojingxuan
http://tuan.qunar.com/hotel/shanghai_city_7888?list=AAAAAAAAAAAAAA&in_track=AAAATAAAATAAAAAAA
http://tuan.qunar.com/hotel/shanghai_city_1991?in_track=home_tuan_content&list=rexiaojingxuan
http://tuan.qunar.com/hotel/shanghai_city_1991?list=AAAAAAAAAAAAAA&in_track=AAAATAAAATAAAAAAA

可以看到,通过泛化之后,已经可以去掉一部分重复的URL了。前段时间看到携程的一篇技术分享,写的是他们内部扫描器的实现,URL去重就是做了泛化实现的。但是泛化的结果有的确实很相似了,但是却并不是完全一样,这时候Simhash的作用就显现了。

第二步: Simhash

Simhash是Google处理网页去重的算法。网上有很多关于这个算法的说明。下面看一下Python代码的实现方式。前人已经写好了python的接口,直接pip install simhash 就可以安装使用了。 Github地址 文档地址

算法也不复杂,不想安装的同学自己写一下算法也可以,代码:

def build_by_features(self, features):
        """
        `features` might be a list of unweighted tokens (a weight of 1
                   will be assumed), a list of (token, weight) tuples or
                   a token -> weight dict.
        """
        v = [0] * self.f
        masks = [1 << i for i in range(self.f)]
        if isinstance(features, dict):
            features = features.items()
        for f in features:
            if isinstance(f, basestring):
                h = self.hashfunc(f.encode('utf-8'))
                w = 1
            else:
                assert isinstance(f, collections.Iterable)
                h = self.hashfunc(f[0].encode('utf-8'))
                w = f[1]
            for i in range(self.f):
                v[i] += w if h & masks[i] else -w
        ans = 0
        for i in range(self.f):
            if v[i] >= 0:
                ans |= masks[i]
        self.value = ans

计算汉明距离代码:

def distance(self, another):
        assert self.f == another.f
        x = (self.value ^ another.value) & ((1 << self.f) - 1)
        ans = 0
        while x:
            ans += 1
            x &= x - 1
        return ans

其实Simhash的作用简单来说就是判断两个URL是否相似,如果汉明距离在一定范围内,就可判断两个URL相似。这个距离在我做IDS的调试时,采用大量的样本做了训练测试,设置在0-2的范围内。在爬虫时可以指定大一点,也无不可。

上述URL执行Simhash去重之后,结果如下:

http://tuan.qunar.com/?in_track=home_tuan_content&tag=jiage_xiaoyu50
http://tuan.qunar.com/?in_track=home_tuan_content&tag=pinpai_jinjiangzhixing
http://tuan.qunar.com/ext/sact/RjABVv?in_track=home_tuan_content_lunbo

可以看到剩下的URL,参数差异都是很大的,已经去掉了很大一部分重复的了。剩下的三个URL,区别都是很大的,效果非常显著。

一些问题

利用IDS对URL作分析时,由于我有大量的样本,所以可以根据相应的URL做特定的策略,做一些指定规则的泛化。但是在爬虫时,很多地方没那么灵活,所以有一些问题还是很棘手的。比如下面:

http://overseas.pinganfang.com/list/cn10-cy49
http://overseas.pinganfang.com/list/cn10-cy59
http://overseas.pinganfang.com/list/cn10-cy68
http://overseas.pinganfang.com/list/cn10-cy69
http://overseas.pinganfang.com/list/cn10-cy70
http://overseas.pinganfang.com/list/cn10-cy71
http://overseas.pinganfang.com/list/cn10-cy74
http://overseas.pinganfang.com/list/cn10-cy97
http://overseas.pinganfang.com/list/cn10-cy103
http://overseas.pinganfang.com/list/cn10-cy113
http://overseas.pinganfang.com/list/cn10-cy114
http://overseas.pinganfang.com/list/cn10-cy117

再比如:

http://xf.pinganfang.com/sh/list/rg2-pe7
http://xf.pinganfang.com/sh/list/rg43-pe2
http://xf.pinganfang.com/sh/list/rg2-lo1
http://xf.pinganfang.com/sh/list/rg43-pe3
http://xf.pinganfang.com/sh/list/rg2-lo3
http://xf.pinganfang.com/sh/list/rg43-pe5
http://xf.pinganfang.com/sh/list/rg14-tg21
http://xf.pinganfang.com/sh/list/rg14-pt1
http://xf.pinganfang.com/sh/list/rg2-lo5
http://xf.pinganfang.com/sh/list/rg14-pt2
http://xf.pinganfang.com/sh/list/rg2-lo6
http://xf.pinganfang.com/sh/list/rg43-lo1
http://xf.pinganfang.com/sh/list/rg2-tg1
http://xf.pinganfang.com/sh/list/rg43-lo2
http://xf.pinganfang.com/sh/list/rg2-tg2
http://xf.pinganfang.com/sh/list/rg43-lo6
http://xf.pinganfang.com/sh/list/rg2-tg4
http://xf.pinganfang.com/sh/list/rg43-tg1
http://xf.pinganfang.com/sh/list/rg14-ct1
http://xf.pinganfang.com/sh/list/rg43-tg2
http://xf.pinganfang.com/sh/list/rg14-st3
http://xf.pinganfang.com/sh/list/rg2-tg7
http://xf.pinganfang.com/sh/list/rg14-st1
http://xf.pinganfang.com/sh/list/rg43-tg4
http://xf.pinganfang.com/sh/list/rg2-tg8
http://xf.pinganfang.com/sh/list/rg2-tg9

再比如很常见的:

http://xf.pinganfang.com/sh/main/detail.id.312731.html
http://xf.pinganfang.com/sh/main/detail.id.312732.html
http://xf.pinganfang.com/sh/main/detail.id.312733.html
http://xf.pinganfang.com/sh/main/detail.id.312734.html
http://xf.pinganfang.com/sh/main/detail.id.312735.html

上述三个页面都有一个特点,就是URL的path层级比较多,都大于等于3,还有就是相似点都在path的最后一级上。所以我们可以对这种URL的path做分割,最后一层做泛化,再做Simhash去重来解决这个问题。

不过在爬虫实现时,我没有采用上述方式,而是直接粗暴了忽略了所有以.html结尾的页面。因为我觉得这些页面大多是静态页面,存在问题的可能性比较小,而且展示的信息对我没什么做信息搜集的作用不大。

总结

Simhash算法应用起来很简单,但是在大数据情况下的应用还是有局限的。比如URL多了之后,那就需要计算出Simhash值之后,对列表内的所有URL计算一次汉明距离,非常影响效率。不过我在IDS实现时,是针对每个path下的不同URL存了一个列表,所以不存在上述问题。目前,我已经成功实现从IDS分析URL信息,并且把这些信息导入到资产管理系统中,建立了一个IP、端口、域名、URL、服务、责任人的立体资产管理系统。后续会配合我的Emscan,不仅定时扫描端口等任务,并且定时扫描我导入的URL,等完全实现了就是一套非常立体,完整的资产漏洞管理系统了。 另外,这周基于隐式马尔科夫模型搭建了一个异常URL检测的系统,后期也会分享出来的。