ANOMALY DETECTION IN LARGE-SCALE DATA USING ISOLATION FOREST AND AUTOENCODER
Abstract and keywords
Abstract (English):
Purpose: the study aims to compare two anomaly detection methods applied to large-scale datasets, the ensemble-based Isolation Forest and the neural network-based Autoencoder. Methods: the investigation entailed modelling and empirical assessment of the algorithms utilizing a genuine credit card transaction dataset. Standard performance metrics such as precision, recall, F1-score, ROC-AUC were used accompanied by a confusion matrix to explore the occurrences of false positives and overlooked anomalies. Results: the findings revealed that both models achieved elevated ROC-AUC scores, confirming their robustness in differentiating between typical and anomalous transactions. Practical significance: the proposed methods can be suitable for the automated supervision of transactional flows, the prevention of fraud, and the analysis of large datasets. The integration of Isolation Forest and Autoencoder in hybrid systems has demonstrated superior effectiveness, enhancing detection accuracy while minimizing the occurrence of false positives in anomaly detection.

Keywords:
large datasets, anomalies, machine learning, neural network, Autoencoder, Isolation Forest, transaction data
Text
Text (PDF): Read Download
References

1. Shkodyrev V. P., Yagafarov K. I., Bashtovenko V. A., Ilyina E. E. Obzor metodov obnaruzheniya anomaliy v potokakh dannykh [The Overview of Anomaly Detection Methods in Data Streams], Proceedings of the Second Conference on Software Engineering and Information Management (SEIM-2017), Saint Petersburg, Russia, April 21, 2017. CEUR Workshop Proceedings, 2017, Vol. 1864, Pp. 50–56. (In Russian)

2. Barsegyan A. A., Kupriyanov M. S., Kholod I. I., et al. Analiz dannykh i protsessov: uchebnoe posobie [Data and Process Analysis: A Tutorial]. Saint Petersburg, BHV-Peterburg Publishing House, 2009, 512 p. (In Russian)

3. Liu F. T., Ting K. M., Zhou Z.-H. Isolation Forest, Proceedings of the Eighth IEEE International Conference on Data Mining, Pisa, Italy, December 15–19, 2008. Institute of Electrical and Electronics Engineers, 2008, Pp. 413-422. DOI:https://doi.org/10.1109/ICDM.2008.17.

4. Scikit-learn: Machine Learning in Python. Available at: http://scikit-learn.org (accessed: October 18, 2025).

5. Pandas: Python Data Analysis Library. Available at: http://pandas.pydata.org (accessed: October 18, 2025).

6. NumPy v2.3 Documentation. Available at: http://numpy.org/doc/2.3 (accessed: October 18, 2025).

7. SciPy v1.16.2 Documentation. Available at: http://docs.scipy.org/doc/scipy (accessed: October 18, 2025).

8. PyOD V2 Documentation. Available at: http://pyod.readthedocs.io (accessed: October 18, 2025).

9. Matplotlib: Visualization with Python. Available at: http://matplotlib.org (accessed: October 18, 2025).

10. Seaborn: Statistical Data Visualization. Available at: http://seaborn.pydata.org (accessed: October 18, 2025).

11. Plotly Open Source Graphing Library for Python. Available at: http://plotly.com/python (accessed: October 18, 2025).

12. Goodfellow I., Bengio Y., Courville A. Autoencoders. In: Goodfellow I., Bengio Y., Courville A. Deep Learning. Cambridge (MA), MIT Press, 2016, Pp. 499–523.

13. Hinton G. E., Salakhutdinov R. R., Science, 2006, Vol. 313, Iss. 5786, Pp. 504–507. DOI:https://doi.org/10.1126/science.112764.

14. Credit Card Fraud Detection: Anonymized Credit Card Transactions Labeled as Fraudulent or Genuine, Kaggle. Available at: http://www.kaggle.com/datasets/mlg-ulb/creditcardfraud (accessed: October 18, 2025).

15. The Numenta Anomaly Benchmark, GitHub. Available at: http://github.com/numenta/NAB (accessed: October 18, 2025).

16. Imbalanced-learn v0.14.0 Documentation. Available at: http://imbalanced-learn.org (accessed: October 18, 2025).

17. Makshanov A. V., Zhuravlev A. E., Tyndykar L. N. Bolshie dannye. Big Data: uchebnik dlya vuzov [Big Data: a textbook for universities]. Saint Petersburg, LAN Publishing House, 2024, 188 p. (In Russian)

18. Feldman E. V., Ruchay A. N., Cherbadzhi D. Y. Model vyyavleniya anomalnykh bankovskikh tranzaktsiy na osnove mashinnogo obucheniya [Model for Detecting Abnormal Banking Transactions Based on Machine Learning], Vestnik UrFO. Bezopasnost v informatsionnoy sfere [Journal of the Ural Federal District. Information Security], 2021, No. 1 (39), Pp. 27–35. DOI:https://doi.org/10.14529/secur210104. (In Russian) EDN: https://elibrary.ru/MEVVAM

19. Novelty and Outlier Detection — Scikit-learn 1.7.2 Documentation. Available at: http://scikit-learn.org/stable/modules/outlier_detection.html (accessed: October 18, 2025).

Login or Create
* Forgot password?