论文标题
Phishsim:使用无功能工具的帮助网络钓鱼网站检测
PhishSim: Aiding Phishing Website Detection with a Feature-Free Tool
论文作者
论文摘要
在本文中,我们提出了一种使用归一化压缩距离(NCD)检测网络钓鱼网站的无功能方法,这是一种无参数的相似性度量,该测度通过压缩来计算两个网站的相似性,从而消除了执行任何特征提取的需求。它还消除了对特定网站功能集的任何依赖。该方法检查了网页的HTML,并计算其与已知网站网站的相似性,以对其进行分类。我们使用最远的点第一算法来执行网络钓鱼原型提取物,以选择代表一组网络钓鱼网页的实例。我们还介绍了使用增量学习算法作为连续和自适应检测的框架,而无需在概念漂移发生时提取新功能。在大型数据集中,我们提出的方法在检测网络钓鱼网站方面的表现明显优于以前的方法,AUC得分为98.68%,高度正面率(TPR)约为90%,同时保持较低的误报率(FPR)为0.58%。我们的方法使用原型,消除了将来保留长期数据的需求,并且可以在处理时间大约0.3秒的实际系统中部署。
In this paper, we propose a feature-free method for detecting phishing websites using the Normalized Compression Distance (NCD), a parameter-free similarity measure which computes the similarity of two websites by compressing them, thus eliminating the need to perform any feature extraction. It also removes any dependence on a specific set of website features. This method examines the HTML of webpages and computes their similarity with known phishing websites, in order to classify them. We use the Furthest Point First algorithm to perform phishing prototype extractions, in order to select instances that are representative of a cluster of phishing webpages. We also introduce the use of an incremental learning algorithm as a framework for continuous and adaptive detection without extracting new features when concept drift occurs. On a large dataset, our proposed method significantly outperforms previous methods in detecting phishing websites, with an AUC score of 98.68%, a high true positive rate (TPR) of around 90%, while maintaining a low false positive rate (FPR) of 0.58%. Our approach uses prototypes, eliminating the need to retain long term data in the future, and is feasible to deploy in real systems with a processing time of roughly 0.3 seconds.