Implementation of RAG Architecture for Automated Document Verification in Corporate Storage Systems Using Large Language Models

Maksim A. Kostin; Dayana M. Davydova; Vladimir E. Petrov

doi:doi:10.20295/2413-2527-2026-246-5-16

Home / Journals / Intellectual Technologies on Transport / Issue 2 / Implementation of RAG Architecture for Automated Document Verification in Corporate Storage Systems Using Large Language Models

Implementation of RAG Architecture for Automated Document Verification in Corporate Storage Systems Using Large Language Models

Submit manuscript Download (RU)PDF
Text

To cite

Citations:

IMPLEMENTATION OF RAG ARCHITECTURE FOR AUTOMATED DOCUMENT VERIFICATION IN CORPORATE STORAGE SYSTEMS USING LARGE LANGUAGE MODELS

Journal: INTELLECTUAL TECHNOLOGIES ON TRANSPORT № 2 , 2026

Rubrics: INFORMATION SECURITY AND DATA PROTECTION

Maksim A. Kostin ¹

Dayana M. Davydova ²

Vladimir E. Petrov ³

Author and publication information

Authors:

1. Emperor Alexander I St. Petersburg State Transport University (Information and Computing Systems Department)
student

Russian Federation

2. Emperor Alexander I St. Petersburg State Transport University (Information and Computing Systems Department, Senior lecturer)
employee

Russian Federation

3. Emperor Alexander I St. Petersburg State Transport University (Information and Computing Systems Department, Associate Professor)
employee

Russian Federation

Type:

Article

DOI:

https://doi.org/10.20295/2413-2527-2026-246-5-16

EDN:

https://elibrary.ru/xcuvxi

Pages:

from 5 to 16

Status:

Published

Received:

26.03.2026

Accepted:

25.05.2026

Published:

24.06.2026

Subject area:

VAK Russia 2.3.6
VAK Russia 1.2.1
UDC 004.041
UDC 004.056

Language:

Russian

Keywords:

large language models, automated document verification, vector search, natural language processing, corporate systems, Ollama, text extraction, semantic analysis

Abstract and keywords

Abstract:
A specialized solution is offered for analyzing documents used in corporate document management systems. Purpose: to develop and implement a RAG architecture for automated verification of corporate documents using locally deployed large language models that identify missing required fields, data format errors, and substantive contradictions in documents of transport and logistics systems. Methods: Java application was designed and implemented programmatically, integrating a text extraction module from PDF and DOCX documents based on Apache libraries, a vector storage with simplified embeddings based on frequency analysis of words, a semantic search algorithm through cosine similarity calculation and an LLM interaction client via the Ollama server API. Results: the developed system has demonstrated the ability to contextually analyze the content of documents and adaptability to variable information presentation formats, which makes it possible to overcome the limitations of traditional systems. The experimental verification was performed on a test set of corporate documents with intentionally introduced errors of various types; the effectiveness was assessed by the metrics of completeness, accuracy and their average, as well as by system response time. The llama3.2:latest model showed the best results, while completely excluding the transfer of confidential data outside the organization’s infrastructure. Practical significance: the proposed solution is applicable for automation of documentation quality control in corporate document management systems of transport enterprises, government agencies and industrial organizations. The modular architecture provides scalability to other types of documents and the ability to integrate with existing information systems at minimal cost of adaptation. Using open models and a local Ollama server reduces dependence on third-party cloud services and ensures compliance with information security requirements.

Keywords:
large language models, automated document verification, vector search, natural language processing, corporate systems, Ollama, text extraction, semantic analysis

Text

Text (RU) (PDF): Read Download

References

1. Rothman D. RAG i generativnyy II. Sozdaem sobstvennye RAG-payplayny s pomoshchyu LlamaIndex, Deep Lake i Pinecon [RAG-Driven Generative AI: Build Custom Retrieval Augmented Generation Pipelines with Llamaindex, Deep Lake and Pinecone]. Saint Petersburg, Piter Publishing House, 2025, 320 p. (In Russian)

2. Java SE/JDK Version 26 API Speciﬁcation: Module java.base. Available at: http://docs.oracle.com/en/java/javase/26/docs/api/java.base/module-summary.html (accessed: January 30, 2026).

3. Ollama’s Documentation. Available at: http://docs.ollama.com (accessed: February 02, 2026).

4. Schildt H. Java. Polnoe rukovodstvo, 12-e izdanie [Java: The Complete Reference. Twelfth Edition]. Saint Petersburg, Dialektika Publishing House, 2023, 1344 p. (In Russian)

5. Horstmann C. S. Java. Biblioteka professionala. Tom 1. Osnovy. Desyatoe izdanie [Core Java. Volume I — Fundamentals. Tenth Edition]. Moscow, Williams Publishing House, 2016, 864 p. (In Russian)

6. Portyankin I. A. Swing. Effektnye polzovatelskie interfeysy [Spectacular User Interfaces]. Moscow, Lori Publishing House, 2011, 607 p. (In Russian)

7. Huang D., Wang Z. LLMs at the Edge: Performance and Efficiency Evaluation with Ollama on Diverse Hardware, Proceedings of the International Joint Conference on Neural Networks (IJCNN 2025), Rome, Italy, June 30 — July 5, 2025. Institute of Electrical and Electronics Engineers, 2025, 8 p. DOI:https://doi.org/10.1109/IJCNN64981.2025.11228317

8. Vahaj M., Raza S. M., Nehra V. Retrieval Augmented Generation (RAG) using LLMs, Proceedings of the Annual International Conference on Data Science, Machine Learning and Blockchain Technology (AICDMB 2025), Mysuru, India, June 27–28, 2025. Institute of Electrical and Electronics Engineers, 2025, 5 p. DOI:https://doi.org/10.1109/AICDMB64359.2025.11277692

9. Bloch J. Java. Effektivnoe programmirovanie. Tretye izdanie [Effective Java. Third Edition]. Saint Petersburg, Dialektika Publishing House, 2019, 464 p. (In Russian)

10. Dmonte A., et al. Claim Veriﬁcation in the Age of Large Language Models: A Survey, ArXiv, 2024, vol. 2408.14317, 9 p. DOI:https://doi.org/10.48550/arXiv.2408.14317

Submit manuscript Download PDF
Text JATS XML

To cite

Citations:

Confirmation

Регистрация