Malicious Webpage Detection Based on Feature Fusion Using Natural Language Processing and Machine Learning

G, Pradeepa and R, Devi (2023) Malicious Webpage Detection Based on Feature Fusion Using Natural Language Processing and Machine Learning. In: 2023 2nd International Conference on Edge Computing and Applications (ICECAA), Namakkal, India.

[thumbnail of Malicious Webpage Detection Based on Feature Fusion Using Natural Language Processing and Machine Learning _ IEEE Conference Publication _ IEEE Xplore.pdf] Archive
Malicious Webpage Detection Based on Feature Fusion Using Natural Language Processing and Machine Learning _ IEEE Conference Publication _ IEEE Xplore.pdf

Download (393kB)

Abstract

Malicious websites are purposefully designed to deceive internet users to steal sensitive personal information, infect the victim's system with malware, cause financial losses, and damage the victim's reputation. Finding these pages or links is hard for internet users. Such websites are discovered using detection tools. The majority of detection techniques use blacklisting or whitelisting strategies to find and prevent malicious websites. However, compiling such a sizable list of website links is a time-consuming job that is challenging to update regularly. Therefore, the researchers employ machine learning-based methods to identify these fraudulent connections. These methods are based on the features taken from URLs or web pages. Additionally, features such as DNS details, webpage reputation, and visual similarity data are used. However, these features are few and do not fully utilize the URLs or website contents. This work focuses on merging URL lexical features and content-based features for malicious webpage detection in order to fully exploit the dataset's potential. Natural language processing methods like Hashing, Count, and Term Frequency - Inverse Document Frequency (TF-IDF) vectorizers are employed to extract features from the content of Web pages. The suggested approach's efficiency is evaluated by using the most well-known machine learning methods. The outcome shows that the Count vectorizer with Random Forest achieves a higher accuracy of 91.17% with 500 features.

Item Type: Conference or Workshop Item (Paper)
Subjects: Computer Applications > Database Management System
Divisions: Computer Applications
Depositing User: Mr IR Admin
Date Deposited: 23 Sep 2024 08:16
Last Modified: 23 Sep 2024 08:16
URI: https://ir.vistas.ac.in/id/eprint/6920

Actions (login required)

View Item
View Item