Properties of LLMs, weak points and improvement measures for the domain adaptation of applications

“Increasing complexity has obviously led to what is not uncommon in complex systems: quantitative increase leads to new qualities”[1]

Wolf Singer (*1943) (translated from German)


The first article in this series on Large Language Models (LLMs) showed how communication with computers using natural language(s) has been made possible over the last few decades. A key aspect of this was the abandonment of the belief that this complex task could be accomplished by means of rules created (and implemented) directly (by humans). These rules were realized by training on extremely large amounts of data and powerful computers using neural networks. Complex rules and the processed data with their complex dependencies are implicitly available in the created language model, which means that some of their basic properties are predetermined.

This generative system, which was developed as a Foundation Model [2] , is capable of many associations to environmental stimuli due to its “broad training”, and thus of creating a context in which texts are produced. From this history of development, foundation models have two basic properties: emergence and homogeneity, i.e. broad applicability to questions that have not been explicitly learned, and usability (after adaptations) for many tasks that were otherwise built with individual solutions (cf. Bommasani et al. (2021)).

This blog post focuses on the possibilities for addressing and adapting LLMs for downstream tasks. It is about increasing efficiency and improving LLMs in the case of weaknesses that also stem from the development history, such as hallucinations, lack of topicality, or “problematic” views due to the training data and some other points, which are listed and examined in more detail, for example, as part of a study to build a taxonomy for the trustworthiness of LLMs, with the aim of systematically working on eliminating these weaknesses (cf. Liu, Y. et al. (2024)). See the following figure.

(Image source: Liu, Y., Yao, Y. et al. (2024). Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models’ Alignment. arXiv:2308.05374v2)

The following three method areas for adapting application domains of LLMs are presented: “In Context Learning (ICL)”, “Prompt-Engineering” and “Fine-Tuning”.

The “learning” in the term “In Context Learning (ICL)” is “difficult” in that it involves the activation and generation of implicit information that is not directly tangible and not the change of neuronal parameters in the model, as is usually the case with learning, for example. However, ICL generates information that has not been explicitly learned in this way (emergence), similar to how human learning succeeds in synthesizing and learning new information from learned categories, so that six-year-olds are already familiar in principle with almost all object categories of the world [3] (cf. Biederman, I. (1987)).

Using the categories learned in the LLM for the joint representation of data, rules and links[4] , the “appropriate” answer is generated by prompting the LLM – e.g. for a question (“zero-shot” method), supplemented by an example (“one-shot” method) or several examples (“few-shot” method). These input instructions to the LLM are also called “prompts” and the discipline for their optimal, systematic and effective creation is called prompt engineering (cf. e.g. Sahoo, P. et al. (2024)).

To ensure a sufficiently good quality of LLM responses, the user must also take into account the specific LLM, its training history and special aspects of the application domain when creating the prompt. To avoid becoming dependent on the skill of special prompt engineers and to transform these manufacturing-like processes into automated ones, more structure and technological functions must be added to the LLM environment – especially for complex tasks. There are approaches to structuring at prompt level and, for example, breaking down and optimizing the prompt into the following parts: “Instruction, Context, Input Data and Output Indicator (format)” in order to then maintain a “library” of such prompts. A current systematic overview, organized according to application areas and the latest work (mostly from 2023) can be found in Sahoo, P.; Singh, A.K. et al. (2024). The following taxonomy also describes a standardized approach to operationalizing prompting depending on the complexity/degree of detail of the requirements (Santu, S.K.K., Feng, D. (2023)). The taxonomy shows important developments, e.g. how the LLM weaknesses of a lack of topicality and detailed expertise can be overcome for highly detailed issues by introducing external knowledge through the RAG (Retrieval Augmented Generation) process (see 4. In Level 5, Level 6 in the figure). For this purpose, the technological environment must be supplemented by e.g. vector databases and new processes (e.g. retrieval, similarity search, etc.). The RAG process, which is important for the further development of the LLM for use in productive environments, is described in detail in article 3 of this blog series.

(Image source : Santu, S.K.K., Feng, D. (2023). TELeR: A General Taxonomy of LLM Prompts for Benchmarking Complex Tasks. arXiv:2305.11430v2)

Other important results from the work with special prompts, which cannot be explained here, are role definitions and output formats as well as approaches for self-checking LLMs. The use of reasoning formats that allow the LLM to work through a task in a systematic and comprehensible manner (Chain of Thought (CoT) or Tree of Thoughts (ToT), Chain of X) are shown below and are intended to illustrate examples of the development lines of the prompt system. These types of prompts have the effect of breaking down the solution to complex problems into a series of intermediate steps and representing them using the LLM (see illustration below).

(Image source: Yao, S. et al (2023). Tree of Thoughts: Deliberate Problem Solving with Large Language Models. arXiv:2305.10601v2)

The prompts diffuse into a large number of variants (see figure below), which are also referred to as the “Chain of X”, which can be operationalized with “creation principles”.

(Image source: Greyling, C. (2023). The Anatomy Of Chain-Of-Thought Prompting (CoT). Medium. https://cobusgreyling.medium.com/the-anatomy-of-chain-of-thought-prompting-cot-b7489c925402 (retrieved 10.07.2024))

Despite all systematization, the fragility due to the dependence on environmental components can hardly be solved without a growing number of special cases. In addition, it is problematic that specialists specify how the model should work, even if one wants to operationalize the construction of prompts with the help of creation principles and make it less arbitrary (cf. e.g. Greyling, C. (2023). It is clear that approaches that tell the LLM “how” to operate tend to move towards “unambiguous” formal languages in order to achieve the goals of objectivity, comprehensibility and unambiguity. In short, there is sometimes a tendency to want to create programming languages. However, LLM is not the appropriate counterpart for this and blatantly misses the point of creating NLP-capable models. Natural language does not simply fulfill the stated goals, but only through a different paradigmatic approach, which we want to develop in the course of the blog series and examine its effects.

In this context, the latest developments for the automated production of optimal prompts, which are currently being developed at Stanford University and are receiving a great deal of attention, should be mentioned: DSPy (Declarative Self-Improving Language Programs (in Python)) (cf. Khattab, O et al. (2023)) and TextGrad (cf. Yuksekgonul, M et al. (2024)).

DSPy is a declarative, self-improving program for the algorithmic optimization of LM prompts and LM parameters, which does not directly specify “how” the LLM should process, but rather specifies the “what” in so-called “signatures”: the input and the desired output. The “how” generates the optimization procedure in the DSPy system. Model parameters can also be changed, unlike in the approaches listed so far. This is already a new approach that is not immediately obvious due to its complexity (cf. Hebbar, S. (2024); Khattab, O. (2024)).

(Image source: Hebbar, S. (2024). DSPy: The Future of Programming Language Models. https://www.linkedin.com/pulse/dspy-future-programming-language-models-srinivas-hebbar-w1jpf)

This also applies in particular to TextGrad, also from the Stanford University environment. It should also be mentioned that the aim here is to optimize the system of components around an LLM. The interesting approach of “differentiation by text” transfers textual feedback from LLMs back to improve individual components of a composite AI system. It calculates optima for more general application situations analogous to the gradient method in backpropagation, e.g. also for the determination of optimal molecule constellations, program codes or radiation therapy treatment plans.

Even if transparency and control are somewhat inferior to the direct prompting procedures, a more user- and environment-neutral approach can be achieved if – as with rule generation in NLP and NLU – successful results are achieved by giving up control and transparency when creating the appropriate “rules”.[5][6] Possibly, a principle?! At least another culture in the creation and use of software, as a result of the new AI paradigm “Foundation Models”.

The methods discussed so far are used in particular in LLM applications with frequently changing data on domain-driven questions. In situations where more stable skills such as technical language and vocabulary, communication styles or skills such as summaries, topic modeling or programming etc. are required, extensive training or fine-tuning is the right approach, as already described at the end of the first part in the transition from the pre-trained model GPT to InstructGPT, GPT 3.5 or ChatGPT. The reinforcement learning approach mentioned there will not be discussed further.

Standard fine-tuning basically affects all parameters of the LLM and is therefore resource-intensive and expensive. If all parameters are left free for design, there is a risk of “catastrophic forgetting” due to changes to important parameters for already established (“learned”) applications. It is therefore worth using so-called PEFT (Parameter Efficient Fine-Tuning) techniques, which freeze a large proportion of the parameters and only modify a much smaller number, based on the transfer learning capabilities of the LLMs. This also protects against the risk of “catastrophic forgetting”.

There are different methodological variations (see Abdullahi, M. (2024)):

  • a special selection of changeable/frozen parameters (“selective PEFT”),
  • everything can be frozen and an additional layer (or more) can be added using adapters or soft prompts (“Additive PEFT”)
  • Reparametrization (“LoRA”) using so-called ” low rank” representations (two matrices of much lower rank, the product of which returns the original number of parameters). Training the smaller matrices can result in a reduction of over 80% of free parameters.

A quantized variant of LoRA (QLoRA) is used to reduce memory requirements and speed up processing, especially RAM in GPUs (see Dettmers, T. et al. (2023)). QLoRA uses only 4-bit representations of the parameters instead of the computational precision of 16 bits, without major losses. A new data type is used in QLoRA today: 4-bit normal float (NF 4). There are also a number of other approaches that cannot be discussed in this blog post.

As a result, the trade-offs of the usual requirements (parameter efficiency, training speed, inference costs, model performance and memory efficiency)[7] can be solved in the PEFT area.

In summary, however, it is clear that many innovative and sustainable measures to overcome the weaknesses of LLMs exist and are constantly being developed. In general, all approaches (top right in the image below) must be used to optimize the design of the LLM.

(Image source: Bouchard, L.-F., Peters, L. (2024). Building LLMs for Production: Enhancing LLM Abilities and Reliability with Prompting, Fine-Tuning, and RAG (English Edition). Kindle version, Towards AI).

The addition of up-to-date and specialist data sources (RAG) to the LLM for context optimization, the optimization of communication with the LLM and the introduction of communication between the modules already point to the future target structure: a system of interacting modules, tools and applications with access to different “knowledge bases” with structured and unstructured data of different modalities that optimizes itself and adapts to new requirements. In the next blog, we will begin presenting these new approaches with context optimization (RAG) and further LLM optimization by integrating additional components/functions.


Bibliography

Abdullahi, M. (2024). An Introduction to LoRa: Unpacking the Theory and Practical Implementation. Medium. https://medium.com/@mujahidabdullahi1992/an-introduction-to-lora-unpacking-the-theory-and-practical-implementation-e665c5d78295 (retrieved 10.07.2024)

Biederman, I. (1987). “Recognition-by-Components: a theory of human understanding”. Psychological Review. 94 (2): 115-147

Bommasani et al (2021). On the opportunities and risks of foundation models. arXiv:2108.07258v3

Bouchard, L.-F., Peters, L. (2024). Building LLMs for Production: Enhancing LLM Abilities and Reliability with Prompting, Fine-Tuning, and RAG (English Edition). Kindle version, Towards AI.

Dettmers, T., Pagnoni, A., Holtzman, A., & Zettlemoyer, L. (2023). QLoRA: Efficient Finetuning of Quantized LLMs. arXiv:2305.14314v1

Dong, Q., Li, L., Dai, D., Zheng, C., Ma, J., Li, R., Xia, H., Xu, J., Wu, Z., Chang, B., Sun, X., Li, L. & Sui, Z. (2024). A Survey on In-context Learning. arXiv:2301.00234v4

Greyling, C. (2023).  The Anatomy Of Chain-Of-Thought Prompting (CoT). Medium. https://cobusgreyling.medium.com/the-anatomy-of-chain-of-thought-prompting-cot-b7489c925402 (retrieved 10.07.2024)

Hebbar, S. (2024). DSPy: The Future of Programming Language Models. https://www.linkedin.com/pulse/dspy-future-programming-language-models-srinivas-hebbar-w1jpf

Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W. (2021). LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685v2

Khattab, O. (2024). DSPy: Programming-not prompting-Foundation Models. GitHub, GitHub repository. https://github.com/stanfordnlp/dspy

Khattab, O., Singhvi, A., Maheshwari, P., Zhang, Z., Santhanam, K., Vardhamanan, S., Haq, S., Sharma, A., Joshi, T.T., Moazam, H., Miller, H., Zaharia, M., & Potts, C. (2023). Dspy: Compiling declarative language model calls into self-improving pipelines. arXiv:2310.03714

Liu, Y., Yao, Y., Ton, J.-F., Zhang, X., Cheng, R.G.H., Klochkov, Y., Taufiq, M.F., & Li, H. (2024). Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models’ Alignment. arXiv:2308.05374v2

Sahoo, P., Singh, A.K., Saha, S., Vinija Jain, V. Mondal S., Chadha, A. (2024). A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications. arXiv:2402.07927v1

Santu, S.K.K., Feng, D. (2023). TELeR: A General Taxonomy of LLM Prompts for Benchmarking

Complex Tasks. arXiv:2305.11430v2

Shen, T., Jin, R., Huang, Y., Liu, C., Dong, W., Guo, Z., Wu, X., Liu, Y., & Xiong, D. (2023). Large Language Model Alignment: A Survey. arXiv:2309.15025v1

Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T. L., Cao, Y., & Narasimhan, K. (2023). Tree of Thoughts: Deliberate Problem Solving with Large Language Models. arXiv:2305.10601v2

Yuksekgonul, M., Bianchi, F., Boen, J., Liu, S., Huang, Z., Guestrin, C., & Zou, J. (2024). TextGrad: Automatic “Differentiation” via Text. arXiv:2406.07496v1


[1] https://www.spiegel.de/politik/das-falsche-rot-der-rose-a-817b363e-0002-0001-0000-000018166285

[2] Foundation models are understood as a general paradigm of AI, not specific to NLP, but are still the most commonly used there today (see Bommasani et al. (2021)).

[3] https://en.wikipedia.org/wiki/One-shot_learning_(computer_vision)

[4] These categories are the representatives generated during deep learning (DL), which in abstract form form the basis for the synthetic generation of the language.

[5] The studies on the structure of optimal prompts, the classifications and evaluation procedures of the prompts are nevertheless very useful and welcome (especially from a scientific point of view). However, industrial use (as described) places different requirements.

[6] The normal RAG approach described in the following blog also generates a large part of the prompt for the LLM in its process “automatically” through an optimized semantic search in external texts.

[7] https://www.coursera.org/learn/generative-ai-with-LLms

Wilhelm Niehoff