Large Language Models: Origin-Use-Further Development

…s’il se trouvait un perroquet qui répondît à tout, je prononcerais sans balancer que c’est un être pensant…

Denis Diderot

(1713 – 1784)

Large language models (LLMs) have made a quantum leap in the field of natural language processing (NLP) over the last five years, both in terms of understanding (natural language understanding, NLU) and generation (natural language generation, NLG) in the development of communication with computers. With ChatGPT, the general public has also become aware of this. The possible uses in companies are beginning to become more and more relevant. A small series of articles will describe the emergence, integration possibilities in processes and effects of LLMs:

  1. History of the LLMs
  2. Properties of the LLMs, foundation models and application options
  3. Integration of optimized LLMs in applications
  4. Status and development prospects of the use of LLM-based solutions
  5. Effects and options for action for the financial sector

This article will outline the history of the development of LLMs up to their current status, with the aim of presenting strengths, weaknesses and manifestations in the next article on this basis.

In November 2022, the software ChatGPT from the company Open AI was made available via Twitter in the form of a simple web application as a small update to existing models, with the request to test it and provide feedback for improvement.[1] Shortly afterwards one, after two months one hundred million users were surprised, excited and unsettled that they could talk to software, even in a reasonably sophisticated way, as if they were talking to people. This was made possible by software from the Generative Pretrained Transformer (GPT-n) series, at the heart of which is a multi-layered neural network with 175 billion parameters[2]. It was developed through extensive training on a huge amount of data and a farm of special hardware.

This expenditure of resources and the ambivalent attitude towards artificial intelligence led not only to great publicity but also to unusual contributions, such as the comparison with the Manhattan Project[3] and the call for a moratorium[4] by well-known players in this field, perhaps partly motivated by the competition. An enormous surge in research and innovation has now ensured that new publications in the field of AI, for example, are outdated or in need of improvement after just a few weeks. This continues to this day.

The comparison with the Manhattan Project was certainly a little far-fetched. However, at least the main visions emerged at this time. With the construction of the first digital computer, the vision of natural language communication between humans and computers emerged. The topic of Natural Language Processing (NLP) was promoted by the task, motivated by the Cold War in the 1950s, of creating an automatic translation program, especially from Russian into English. However, the computational linguists’ approach underestimated the complexity of using manual rules to integrate the respective meaning of words into context models in such a way that language comprehension can at least be adequately simulated. This symbolic NLP approach failed. However, it did contribute to the subsequent development of software libraries with direct solutions for NLP tasks.[5]

The computer obviously had to generate the rules for language capability itself, with its now highly developed computing capacity with GPUs[6] for this purpose. Machine learning, self-supervised[7] (with multi-layered neural networks and using a vast amount of data that was now available in the digital world) was used to achieve this goal.

The basic idea of moving from syntax to semantics is to capture “the meaning of words” and make it operable by considering the set of contexts in which these words occur.[8]

The learning objective for the neural network is to predict the next suitable word for a given word sequence.[9] To do this, the entire sequence of words must be taken into account in the learning cycles in order to do justice to the above basic idea of capturing the semantics.

Methodologically, this is achieved through the attention concept (see Bahdanau, et al., 2015)[10]; the transformer concept has implemented it appropriately by also enabling the parallelization of calculations. This contribution “Attention is all you need”[11] by A. Vaswani, N. Shazeer, N. Parmar et al. led to a breakthrough in 2017.

In very simplified terms, the transformer is represented by an encoder and a decoder part, which ensure that in the learning cycles the sequence words are assigned to vectors in a high-dimensional vector space (over 700, usually over 1000 dimensions[12] in which words similar in meaning have vectors that are close to each other.

In this way, the “language understanding” is sufficiently filtered from the very large amount of data into a latent (or representative) space[13], which makes it possible to generate synthetic words by means of probability distributions that are also generated, which are meaningfully combined into sentences in the appropriate context. The exact mechanism of the learned neural network remains (initially?) hidden, which is important for further considerations.

Almost all language models that exist today are transformer-oriented and can be divided into those that strictly follow the concept and those that are encoder- or decoder-oriented. Examples of the latter are the first language models published in June and October 2018: GPT (Generative pre-trained transformers, from Open AI) (cf. Radford et al., 2018) and BERT (Bidirectional Encoder Representations from Transformers, from Google) (cf. Devlin, Chang, Lee & Toutanova, 2018). Large language models (LLMs) are usually created in a multi-stage process.

After (pre-) training, a basic model (Pretrained Language Model, PLM) has been created, which then needs to be “fine-tuned”[14], e.g. in order to conduct conversations or perform certain functions better. This is possible by means of reinforcement learning and human evaluation (reinforcement learning from human feedback, RLHF)[15] and fine-tuning procedures and is carried out in a “supervised” manner:

(Image source: Raschka. Build a Large Language Model (From Scratch) Version 4, Manning Early Access Program 2024, to be published in full by Manning Publications in late 2024)

The term “large language model” is actually only used when the number of parameters exceeds 1 billion. The basic version “Base” of BERT has 110 million trainable parameters. BERTLarge has 340 million parameters. These models serve as the basis for a series of “descendants” that are tailored to special domains or questions.

GPT-1 had 117 million parameters, but GPT-2 already had 1.5 billion in November 2019 and GPT-3 already had 175 billion parameters in May 2020.

Of course, numerous other companies have also published LLMs. We will discuss some of them later. In total, there are also thousands of LLMs in existence today, also as descendants of some pioneers (cf. S. Gao, A. K. Gao, 2023).

In principle, GPT and BERT follow different lines of development. BERT has concentrated on adapting to special tasks with a relatively small number of parameters. BERT is encoder-oriented and open source and can be easily adapted due to its size.

This is why there are so many publications on a wide range of applications and so many BERT descendants that are available and specialized. Hugging Face lists thousands of models.[16]

The path of the GPT-n series (decoder oriented and license) is different, the size has grown enormously. GPT 3, with 175 billion parameters, is ten or one hundred times larger than its predecessors. It is also not an open source solution but can be used via a paid API at OpenAI.

Image source: Zhao et al. “A survey of large language models”, arXiv:2303.18223, 2023

For the final development of ChatGPT for the time being, GPT-3 was again improved in two areas in particular through training on “code generation” and “understanding instructions”: On the one hand, beyond the code topic to skills in reasoning and solving complex tasks. Using reinforcement learning with human judgment (RLHF), GPT-2 had already been improved for NLP tasks, and now GPT-3 (via InstructGPT) has been brought forward in terms of dialog capability.

With ChatGPT, over the course of 80 years of progress and setbacks, the goal of software that has given the computer a surprisingly good language capability has now been achieved.


Bahdanau, D., Cho, KH., & Bengio, Y. (2015). Neural Machine Translation by Jointly Learning to Align and Translate.

Devlin, J., Chang, M., Lee, K., &Toutanova, K. (2019). BERT: pre-training of deep bidirectional transformers for language understanding. NAACL-HLT, Burstein, J., Doran, C., & Solorio, T. Eds. Association for Computational Linguistics, 1, 4171-4186.

Gao, S., Gao, A. K. (2023). On the Origin of LLMs: An Evolutionary Tree and Graph for 15,821 Large Language Models.

Radford, A., Narasimhan, K., Salimans, T., Sutskever, I. (2018). Improving Language Understanding by Generative Pre-Training.

Tunstall, L., von Werra, L., Wolf, T. (2023). Natural Language Processing mit Transformern. German Edition, O’Reilly.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems December 4- 9, 2017, Long Beach, CA, USA, 5998-6008.

Yildiz, M., Hattatoglu, F., Erdogan, M., Erboga, M. (2023). Generative AI and Large Language Models: An Overview of Current Trends and Terminology in the Field. Independently published.

Zhao, W. X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y., … & Wen, J. R. (2023a). A survey of large language models. arXiv preprint.

[1] OpenAI. (@OpenAI), “OpenAI on Twitter: ‘Try talking with ChatGPT, our new AI system which is optimized for dialogue. Your feedback will help us improve it.” Twitter. 1598014522098208769?cxt=HHwWgsCi-bfvpK0sAAAA

[2] Changeable weights of a neural network during the “learning process”

[3] Large-scale military nuclear research project that led to the construction of the first atomic bomb in 1945 at great expense of resources

[4] See e.g.!5925502/

[5] For example spaCy (

[6]Graphics processing unit (GPU)

[7] Supervised machine learning with labels (annotations) that are generated without human intervention

[8] This background to the semantic approach also stems from the “Distributional Hypothesis”, which states: “A word is characterized by the company it keeps” from the 1950s, attributed to J.R. Firth

[9] Application to the given data (texts) obviously enables self-supervised learning (learning method that works by covering words in sentences, where the words are estimated)

[10] Attention principle still for Recurrent Neural Networks (RNNs) with Long Short Term Memory (LSTM)

[11] This title inspired no less than 50 follow-up articles ( to include “all you need” in their titles

[12] In order to be able to calculate with words, they must be translated into number space (vector space). This topic will be explained in more detail later in the series under “Embedding”

[13] Cf.

[14] Adaptation of the model to special requirements through retraining

[15] Reinforcement learning is a subtype of machine learning through feedback in the form of “reward and punishment”, RLHF through human feedback.


Wilhelm Niehoff
Latest posts by Wilhelm Niehoff (see all)

Was sind Deine Erfahrungen mit dem Thema? (Kommentieren geht auch ohne Anmeldung oder Einloggen; einfach kommentieren, auf Freigabe warten und fertig!)