Abstract and keywords
Abstract:
This paper examines the architecture of multi-agent systems (MAS), agent properties, communication features, and the applicability of this approach to web scraping tasks. The relevance of the study is determined by the rapid growth of data volumes on the Internet and the limitations of traditional centralized web scraping systems that encounter challenges related to scalability, blocking, and insufficient robustness against dynamic website changes. In this context, there is an increasing demand for decentralized architectures that adapt to evolving environments and efficiently collect vast quantities of information. One of the most promising approaches is the deployment of multi-agent systems, which enable distributed data collection, parallel processing, and resilient storage. Purpose: to develop and structure an approach for utilizing multi-agent systems in web scraping, as well as to describe a generalized algorithm that ensures scalable, fault-tolerant, and adaptive data collection. Methods: the study employs theoretical analysis of multi-agent system properties, architectural models, and inter-agent communication mechanisms; an examination of existing practical implementations of distributed web crawling; and the synthesis of a generalized algorithm constructed upon the identification of typical agent roles: scheduler, collector, parser, data processor, and protection bypass agent. Results: the findings reveal a three-tiered architecture for the multi-agent system, including levels for data collection, processing/coordinating, and storage. Key properties of agents are highlighted, demonstrating their distinct contributions to the scraping task. The functions of five types of agents used in distributed web scraping are presented, alongside a proposed interaction scheme illustrating their collaborative engagement. Based on the analysis of existing solutions, a generalized algorithm for distributed scraping has been formulated, reflecting the interaction of these specialized agents. This algorithm encompasses distinct stages: initialization, task distribution, page loading, error handling in blocking scenarios, content parsing, and data storage. The findings indicate that the multi-agent approach provides parallelism, scalability, fault tolerance, and flexibility, adapting to diverse web resources and evolving challenges. Practical significance: the results of this research can be used in the design of mass data collection systems, the construction of distributed web crawlers, and the creation of information analysis platforms based on multi-agent systems. The generalized algorithm can serve as the basis for implementing flexible and scalable systems capable of functioning effectively in the context of vast data volumes, dynamic web page alterations, and robust protective mechanisms. Discussion: this article describes the integration of multi-agent system properties and principles into web scraping processes, culminating in the formation of a unified generalized model of agent interaction. The presented algorithm mirrors the practical structure of a distributed crawler and demonstrates how different types of agents can coordinate, collect, analyze, and filter data when interacting with dynamic and secure web resources. The importance of decentralization and adaptability for modern web scraping is emphasized, particularly in scenarios constrained by anti-bot protection.

Keywords:
multi-agent systems, scraping, scaling, proactivity, autonomy
Text
Text (RU) (PDF): Read Download
References

1. Coughlin T. 175 Zettabytes By 2025, Forbes. Published online at November 27, 2018. Available at: http://www.forbes.com/sites/tomcoughlin/2018/11/27/175-zettabytes-by-2025 (accessed: October 05, 2025).

2. Barrett A. How to Scrape Websites at Large Scale, Octoparse Web Scraping Blog. Published online at August 30, 2022. Available at: http://www.octoparse.com/blog/scrape-websites-at-large-scale (accessed: October 05, 2025).

3. Jennings N. R., Wooldridge M. J. Applications of Intelligent Agents. In: Jennings N. R., Wooldridge M. J. (eds) Agent Technology: Foundations, Applications, and Markets. Heidelberg, Springer, 1998, pp. 3–28. DOI:https://doi.org/10.1007/978-3-66203678-5_1.

4. Fowler M. Arkhitektura korporativnykh programmnykh prilozheniy [Patterns of enterprise application architecture]. Moscow, Williams Publishing House, 2006, 544 p. (In Russian)

5. De Ridder A. An Introduction to FIPA Agent Communication Language: Standards for Interoperable Multi-Agent Systems, SmythOS AI Blog. Available at: http://smythos.com/developers/agent-development/fipa-agent-communication-language (accessed: November 22, 2025).

6. Kiyaev V. I., Granichin O. N. Informatsionnye tekhnologii v upravlenii predpriyatiem: kratkiy uchebnyy kurs [Information Technology in Business Management: A Concise Educational Course]. Moscow, INTUIT, 2016, 361 p. (In Russian)

7. Tomala K., et al. The Data Extraction Using Distributed Crawler Inside the Multi-Agent System, Advances in Electrical and Electronic Engineering, 2013. Vol. 11, no. 6. Pp. 455–460. DOI:https://doi.org/10.15598/aeee.v11i6.867.

8. Bray T., et al. (eds) Extensible Markup Language (XML) 1.0 (Fifth Edition) — W3C Recommendation 26 November 2008. Available at: http://www.w3.org/TR/xml (accessed: November 22, 2025).

9. Transmission Control Protocol, Wikipedia. Available at: http://en.wikipedia.org/wiki/Transmission_Control_Protocol (accessed: November 22, 2025).

10. MD5, Wikipedia. Available at: http://en.wikipedia.org/wiki/MD5 (accessed: November 22, 2025).

Login or Create
* Forgot password?