Large Language Models and RAG

“Draw a distinction.”

George Spencer-Brown

(1923-2016)

This article is part of a series on generative artificial intelligence. The two previous articles showed how LLMs came about and how they can be used (Part 1) and what characteristics and weaknesses LLMs have, as well as possible approaches for improvement (Part 2).

The special feature of LLMs: they can be asked questions using natural language and also respond using natural language. The knowledge required for this was stored in the parameters of the LLM during extensive training with huge amounts of data (“parametric memory”). The sometimes inadequate quality of the answers, especially in specific domains, can be improved on the basis of “in context learning” approaches using prompt engineering methods or the more complex fine-tuning of the LLM. This is one of two dimensions of improvement possibilities, as described in the last article – chaniging the capabilities of the model (here LLM) and/or the quality of the contextprovided:

(Image source: Bouchard, L.-F., Peters, L. (2024). Building LLMs for Production: Enhancing LLM Abilities and Reliability with Prompting, Fine-Tuning, and RAG (English Edition). Kindle version, Towards AI).

In the last article, prompt engineering and fine-tuning were discussed in more detail. In addition to these more data-oriented change methods, there are also “infrastructural” developments.

The LLM models are also evolving. Two innovations should be mentioned as examples:

  • The downsizing of the models (Small Language Models), In addition to the large models, e.g. from Open AI, Meta, Google, etc., LLM families with small and medium-sized LLMs are now being released. Smaller models (number of parameters in the single-digit billion range) are cheaper and easier to train/finetune and operate as local models on end devices.
  • The extension to multimodal models, i.e. models that can process sound, images and videos (simultaneously) in addition to speech and support them as input and output. The capabilities of LLMs with these multimodal extensions are significantly improved but cannot be presented here due to the scope of this article.

The other dimension of LLM improvement mentioned in the last article, contextual enhancement by means of a Retrieval Augmented Generation (RAG) method, has not yet been dealt with in more detail. As described above and in the first articles, LLMs have a general, static “parametric knowledge”. In order to meet the more dynamic requirements in the application reality (subject-specific, current, changing environments), external knowledge (e.g. in the form of text documents) is retrieved by the RAG for a query on a topic and made available to the LLM in the prompt in order to enable a context-relevant, more precise answer:

(Image source: Kimothi, A. (2024). A Simple Guide to Retrieval Augmented Generation (Version 3). Manning Publications (MEAP Edition).

The RAG method is based on work carried out in the context of Meta[1] in 2020/2021 (cf. Lewis et al., 2021). With this process, LLMs (especially local ones) can be better integrated into everyday business life, as the quality of LLMs can be greatly increased in question/answer situations by making it possible to access current and subject-specific (internal) information. The process can obviously be used in many situations. It is therefore understandable that RAG processes were one of the most intensive research topics in the field of artificial intelligence (AI) in 2023 and that RAG realizations are among the most frequently encountered in the AI environment in 2024[2], e.g. for new adobe and Microsoft or Google products. In the standard process, unstructured text data is usually used as external information.

What does the process look like that enables the LLM to provide the most accurate and relevant answer to a question? The provision of the appropriate (external) documents, from which the most suitable information is retrieved in the RAG process in order to make it available for the LLM, is essential for success.

The core element of the retrieval process is the use of the “semantic search” for the appropriate information. How can semantics be incorporated into automatic processes, how can the meaning of words, sentences or text passages be captured and made operationally manageable?

To do this, you need to know how an LLM processes language internally. Words are usually broken down into tokens (the unit of calculation according to the number of which Open AI, for example, charges for the use of its API). For semantic recording, tokens are combined into so-called chunks, often always in the same number of tokens per chunk. This breaks down a text into many chunks. Now the chunks are each assigned to a high-dimensional vector (fixed dimension of over 500, sometimes even over 1000).

The position of the vectors reflects the similarity in meaning, close together means more similar in meaning. The proximity between vectors is often measured, for example, by the angle between the two. [3]

How can this mapping between text and vectors be achieved? So-called embedding methods are used, which have been developed through training on large data sets and map text into high-dimensional vector spaces[4] so that it can then be “mathematically treated”, e.g. with neural networks. Technically, there is software for tokenization, chunking and embedding processes. The vectors are managed in so-called vector databases.

After this preparation, the appropriate information must now be found. As a starting point, the vector for the question (“query”) posed to the LLM is considered and the closest possible vectors to the external text are searched for in order to determine text that is similar in meaning. The locations found are subjected to a ranking procedure and the hit(s) are passed to the prompt together with the query and any additional technical instructions (e.g. on the behavior of the LLM). The LLM then provides the answer to the question.

There are obviously many “adjusting screws” (e.g. all the methods listed above) that can be configured differently and therefore influence the quality of the answer. This is why the simple basic method described above is also called “Naive RAG” and the advanced RAG based on numerous changes and additions to these methods is also called “Advanced RAG” (as a collective term).[5]

Chunking is an example of the further development of methods in the context of “Advanced RAG”. Obviously, successful retrieval depends on the granularity and the appropriate cutting of the text into chunks and thus vectors. The aim of “semantic chunking” is to better map the context of meaning. There are a number of approaches for this. For example, rewriting the question, generating synthetic documents, iterative and recursive retrieval methods are used. This helps to find relevant passages (“recall”). In prompting, the use of Chain of Thought (CoT) prompts or similar methods (cf. Article 2) ensures that the LLM delivers more precise results (“precision”)[6], sometimes even automatically (DSPy – cf. Article 2), through a sequence of linguistically formulated intermediate steps.

Another means of better penetrating the content of the external knowledge provided are so-called “knowledge graphs”, which are managed in graph databases and are structured data.[7]

A knowledge graph is a data structure that decomposes data into semantic units that represent concepts and entities using nodes and links these nodes together through relationships. For example, nodes represent entities such as people, companies, articles and tasks, while edges denote the relationships between them, such as employment, reporting structures, mentions in articles and task assignments, and can store both structured information and unstructured information. Through a deeper and broader understanding of context, questions that span multiple documents/topics can also be answered, such as “What are the main topics in the dataset?”. In addition, knowledge processing can reduce hallucinations and increase accuracy. This capability makes them ideal for RAG applications as they enable efficient storage, retrieval and contextualization of complex and interconnected data (Bülow, 2024).

Overall, “Advanced RAG” offers a number of new and more efficient methods that make RAG more dynamic and more accurate in retrieval and generate a relevant answer. Iterative and adaptive methods are also supported, which support a more flexible consideration of changing application environments.  

For a more detailed description of further developments in the context of “Advanced RAG”, see e.g. Xalli AI Lab (2024); Huang, Y., Huang, J. X. (2024); Singh, V. (2024).

The above objectives are pursued even more consistently by the “Advanced RAG” and “Modular RAG” approach. Components are built and used for optimal chunking or semantic search and an appropriate ranking of search results, etc. The modules increase the flexibility and maintainability of the RAG system. However, these environments are more complex to manage, especially as iterative and adaptive links increase precision and the range of applications.

In addition to these challenges on the system architecture and operation side, the measurement of the optimality of these systems by means of evaluation software, which can also be used with its metrics for iterative optimization in running RAG systems, is of course of interest. RAG systems should ensure that LLM answers are relevantly anchored in the given context and that hallucinations are reduced/avoided, which is also the purpose of the large number of suggestions for improvement in pre- and post-retrieval in the course of advanced RAG. To check the extent to which expectations are met, there are evaluation approaches for the entire system and in particular the main subcomponents of the RAG system, the retriever and generation.

When evaluating RAG systems, a distinction is usually made between measuring the quality of the retriever and the generator. Typical methods are BLEU or ROUGE, for example. In addition, there are LLM-based metrics that generate synthetic data sets for verification and benchmarking, another important function of evaluation methods, e.g. RAGAs and ARES. A very new framework, RagChecker (see Ru, D., Qiu, L., Hu, X. et al. (2024)), will be presented here. This evaluation software is published as open source and, as described above, looks at the overall system and the two subsystems. So-called claims are extracted from the texts and the number of correct and incorrect claims in the model response and the chunks selected during retrieval are considered. The metrics “precision” and “recall”, already known from machine learning, are used for this[8], whereby precision expresses the ratio of true claims to the total number of claims made in the model response and recall expresses the proportion of the total number of correct claims recorded by the model (see following graphic):

(Image source: Ru, D., Qiu, L., Hu, X. et al. (2024). RAGChecker: A Fine-grained Framework for Diagnosing Retrieval-Augmented Generation. arXiv:2408.08067v2, August 17, 2024)

Overall, the development of the use of LLMs increasingly points to the embedding of one or more LLMs in RAG systems, which must then also be able to process heterogeneous data from different sources. As the environments become more complex in the context of comprehensive automation and the increasing quality requirements lead to more iterative and adaptive processes, system design and operation come to the fore. This is particularly true as the development of components as agents that act autonomously can be observed.

The next article will deal with LLM-based agents and applications of RAG systems.


[1] Facebook AI Research (FAIR)

[2] Typical applications are “Ask your document”, support from service desks or chatbots.

[3] The cosine of the angle between the two vectors corresponds to the scalar product of the vectors, divided by the product of the lengths of the vectors. This allows you to work with the scalar product, which is easy to calculate.

[4] Ultimately, the more than sixty-year-old “Distributional Hypothesis” is used as the basic idea: “A word is characterized by the company it keeps” (cf. 1 Firth, J.R. (1957).

[5] For the meaning and classification of these terms in an overall view of the status and planned developments, see Gao, Y. et al. (2024).

[6] See e.g. Huang, Y., Huang, J. X. (2024); Singh, V. (2024) on the methods mentioned in this paragraph.

[7] The conversion of unstructured data (text documents) into structured data is often used in practice as information extraction in LLM and RAG projects.

[8] See also above in connection with the improvement of the Naive RAG as part of the Advance RAG.

Bibliography

Bülow, J. (2024). Knowledge Graph RAG: The future of AI-supported knowledge utilization. https://de.linkedin.com/pulse/knowledge-graph-rag-die-zukunft-der-ki-gest%C3%BCtzten-j%C3%B6rn-b%C3%BClow-4gsde (retrieved on 05.09.2024)

Firth, J.R. (1957). “A synopsis of linguistic theory 1930-1955”. Studies in Linguistic Analysis: 1-32. Reprinted in F.R. Palmer, ed. (1968). Selected Papers of J.R. Firth 1952-1959. London: Longman

Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., Dai, Y., Sun, J. & Wang, H. (2024). Retrieval-Augmented Generation for Large Language Models: A Survey. arXiv:2312.10997v5

Gao, Y., Xiong, Y., Wang, M., Wang, H. (2024). Modular RAG: Transforming RAG Systems into LEGO-like Reconfigurable Frameworks. arXiv:2407.21059v1

Greyling, C. (2024). RAG & LLM Context Size. Medium. https://cobusgreyling.medium.com/rag-llm-context-size-6728a2f44beb (retrieved 05.09.2024)

Huang, Y., Huang, J. X. (2024). The Survey of Retrieval-Augmented Text Generation in Large Language Models. arXiv:2404.10981v2

June, F. (2024). Advanced RAG 01: Problems of Naive RAG. Medium. https://ai.plainenglish.io/advanced-rag-part-01-problems-of-naive-rag-7e5f8ebb68d5 (retrieved on 06.09.2024)

Kimothi, A. (2024). A Simple Guide to Retrieval Augmented Generation (Version 3). Manning Publications (MEAP Edition).

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W., Rocktäschel, T., Riedel, S., Kiela, D. (2021). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv:2005.11401v4

Microsoft (2024). GraphRAG. GitHub, GitHub repository. https://github.com/microsoft/graphrag

Ru, D., Qiu, L., Hu, X. et al. (2024). RAGChecker: A Fine-grained Framework for Diagnosing Retrieval-Augmented Generation. arXiv:2408.08067v2

Singh, V. (2024). Building LLM Applications: Advanced RAG (Part 10). Medium. https://medium.com/@vipra_singh/building-llm-applications-advanced-rag-part-10-ec0fe735aeb1 (retrieved on 06.09.2024) Xalli AI Lab (2024). LLM: Building Advanced RAG (Part 1). Medium. https://xallyai.medium.com/llm-building-advanced-rag-part-1-14e9f7a8f063 (retrieved on 27.09.2024)

Wilhelm Niehoff