ВПЛИВ ПОПЕРЕДНЬОЇ ОБРОБКИ ДАНИХ НА ПРОДУКТИВНІСТЬ МОДЕЛІ RANDOM FOREST У ВИЯВЛЕННІ МЕРЕЖЕВИХ АТАК

Maxim PRODEUS; Andrii NICHEPORUK; Oleg IVANCHENKO

doi:10.31891/2219-9365-2024-80-51

Authors

Maxim PRODEUS Khmelnytskyi National University https://orcid.org/0009-0002-2968-4648
Andrii NICHEPORUK Khmelnytskyi National University https://orcid.org/0000-0002-7230-9475
Oleg IVANCHENKO Dnipro University of Technology https://orcid.org/0000-0002-5921-5757

DOI:

https://doi.org/10.31891/2219-9365-2024-80-51

Keywords:

Random Forest, data handling, standardization, feature selection, normalization, imputation, PCA, machine learning, performance metrics, data preprocessing

Abstract

The performance of machine learning models is strongly influenced by the quality of data preprocessing techniques applied. Effective preprocessing not only enhances the efficiency of the learning algorithms but also ensures that the models generalize well to unseen data. This research examines how different preprocessing strategies affect the efficiency and predictive performance of the Random Forest model, which is widely used due to its robustness and ability to handle complex datasets with high dimensionality.
In this study, we analyze the impact of various preprocessing methods, including standardization, normalization, managing missing data, and feature selection. Standardization and normalization are critical when dealing with features that have different scales, as they help in maintaining balanced contributions from each feature, thus preventing bias in the model’s learning process. Managing missing data is equally crucial, as improper handling can introduce noise, reduce data quality, and significantly degrade model performance. Feature selection, on the other hand, helps in reducing overfitting, improving model interpretability, and decreasing computational costs by identifying the most relevant variables.
To evaluate these techniques, we leverage a comprehensive dataset and systematically compare the Random Forest model's outcomes under various preprocessing approaches. Key performance metrics such as accuracy, precision, recall, and F1-score are used to assess the effectiveness of each method. Our results demonstrate that standardization and feature importance ranking significantly improve model performance by enhancing data consistency and focusing the model on the most informative features. Conversely, poor handling of missing data leads to substantial performance degradation, highlighting the sensitivity of the model to data quality issues.
These findings underscore the essential role of effective data preprocessing in refining Random Forest models. They offer valuable guidance for machine learning professionals, emphasizing the need for meticulous data preparation to achieve optimal results. This research contributes to a deeper understanding of how strategic preprocessing choices can lead to more accurate, reliable, and robust machine learning models in various application domains.

THE INFLUENCE OF DATA PREPROCESSING ON THE PERFORMANCE OF THE RANDOM FOREST MODEL IN DETECTING NETWORK ATTACKS

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

License

Language

Information