论文标题
朝着网络网络钓鱼检测限制和缓解措施
Towards Web Phishing Detection Limitations and Mitigation
论文作者
论文摘要
网络网络钓鱼仍然是严重的网络威胁,造成大多数数据泄露。基于机器学习(ML)的反向钓鱼探测器被视为有效的对策,并且越来越多地被网络浏览器和软件产品采用。但是,由于平均每小时向诸如Phishtank和Virustotal(VT)的平台报告的10K网络钓鱼链接,因此裸露了此类基于ML的解决方案的缺陷。我们首先探索网络钓鱼网站如何绕过基于ML的检测,深入研究以Facebook等主要品牌为目标的13K网络钓鱼页面。结果表明,成功的逃避是由以下原因引起的:(1)使用良性服务掩盖网络钓鱼URL; (2)网络钓鱼和良性页的HTML结构之间的高相似性; (3)将最终的网络钓鱼内容隐藏在JavaScript中并仅在客户端上运行此类脚本; (4)超越典型的凭证和信用卡,以了解新内容,例如ID和文件; (5)将网络钓鱼含量隐藏到人类相互作用后。我们将根本原因归因于基于ML的模型对垂直特征空间(网页内容)的依赖性。这些解决方案仅依赖于页面本身中存在的法钓鱼。因此,我们提出了基于逻辑回归的反对性模型的反刺激性模型。关键的增强是包括水平特征空间,该空间检查可疑页面的最终渲染与可信服务所记录的内容(例如Pagerank)之间的相关变量。为了击败(1)和(2),我们将信息之间的信息关联到WHOIS,PAGERANK和PAGE ANALYTICS之间。为了对抗(3),(4)和(5),我们在渲染页面后将功能关联。 100K网络钓鱼/良性站点的实验表现出有希望的准确性(98.8%)。我们还获得了手动制作的0天网络钓鱼页面的100%精度,与前四天的VT供应商记录的0%相比,我们还获得了精确度。
Web phishing remains a serious cyber threat responsible for most data breaches. Machine Learning (ML)-based anti-phishing detectors are seen as an effective countermeasure, and are increasingly adopted by web-browsers and software products. However, with an average of 10K phishing links reported per hour to platforms such as PhishTank and VirusTotal (VT), the deficiencies of such ML-based solutions are laid bare. We first explore how phishing sites bypass ML-based detection with a deep dive into 13K phishing pages targeting major brands such as Facebook. Results show successful evasion is caused by: (1) use of benign services to obscure phishing URLs; (2) high similarity between the HTML structures of phishing and benign pages; (3) hiding the ultimate phishing content within Javascript and running such scripts only on the client; (4) looking beyond typical credentials and credit cards for new content such as IDs and documents; (5) hiding phishing content until after human interaction. We attribute the root cause to the dependency of ML-based models on the vertical feature space (webpage content). These solutions rely only on what phishers present within the page itself. Thus, we propose Anti-SubtlePhish, a more resilient model based on logistic regression. The key augmentation is the inclusion of a horizontal feature space, which examines correlation variables between the final render of suspicious pages against what trusted services have recorded (e.g., PageRank). To defeat (1) and (2), we correlate information between WHOIS, PageRank, and page analytics. To combat (3), (4) and (5), we correlate features after rendering the page. Experiments with 100K phishing/benign sites show promising accuracy (98.8%). We also obtained 100% accuracy against 0-day phishing pages that were manually crafted, comparing well to the 0% recorded by VT vendors over the first four days.