🇮🇷 Iran Proxy | https://www.wikipedia.org/wiki/Draft:Imbalanced_datasets_in_malware_detection
Jump to content

Draft:Imbalanced datasets in malware detection

From Wikipedia, the free encyclopedia


In the field of cybersecurity, imbalanced datasets pose a major challenge for training machine learning and deep learning models to detect malware.[1] In real-world security environments, the proportion of malicious samples is very small compared to benign, ranging from 0.01% to 2% of observed data.[2] This imbalance may cause traditional classifiers to become biased towards the majority (benign) class, achieving high overall accuracy but failing to correctly identify malicious samples.[1]

Problem

[edit]

Traditional machine learning models trained on imbalanced datasets tend to exhibit bias towards the majority class, resulting in poor precision and recall for malware detection.[2][1]

Approaches

[edit]

Prior to transformer-based solutions, several methods have been examined to address class imbalance in software samples. These methods include sequence-based long short-term memory (LSTM) models, as well as statistical approaches such as n-gram language models. These approaches work well when the dataset is balanced, but their performance quickly drops when malware samples were proportioned realistically.[2]

BERT-Based Solution

[edit]

Recent research has explored the use of BERT (language model), originally developed for natural language processing, to address highly imbalanced datasets in malware detection.[2][3] By treating application activity sequences as natural language data, BERT based methods have reported improved performance. One study found BERT achieved an F1 Score of 0.919 on datasets with only 0.5% malware samples, significantly outperforming traditional approaches.[2]

This approach works by:

  • Analyzing sequences of application activities rather than individual features
  • Using BERT's pre-trained language model capabilities
  • Fine-tuning on android activity sequence data

This method addresses the fundamental problem of oversampling and undersampling in data analysis specific to cybersecurity, where malicious samples are extremely rare.[2]

References

[edit]
  1. ^ a b c Almajed, Hussain; Alsaqer, Abdulrahman; Frikha, Mounir (2025). "Imbalance Datasets in Malware Detection: A Review of Current Solutions and Future Directions". International Journal of Advanced Computer Science and Applications. 16 (1). doi:10.14569/IJACSA.2025.01601126.
  2. ^ a b c d e f Oak, Rajvardhan; Du, Min; Yan, David; Takawale, Harshvardhan; Amit, Idan (11 November 2019). "Malware Detection on Highly Imbalanced Data through Sequence Modeling". Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security. ACM. pp. 37–48. doi:10.1145/3338501.3357374. ISBN 978-1-4503-6833-9.
  3. ^ Demirkıran, Ferhat; Çayır, Aykut; Ünal, Uğur; Dağ, Hasan (2022-06-22), An Ensemble of Pre-trained Transformer Models For Imbalanced Multiclass Malware Classification, arXiv:2112.13236