Data Lineage: A Path to the Data-Driven Enterprise?
In recent years, the call for data-driven decisions and processes has grown rapidly in companies across every industry. Data-driven[a] companies such as Apple, Alphabet, and Microsoft are now among the most valuable companies in the world . But in order to make data-driven decisions, different challenges concerning a company and its data (management) need to be solved. These include the establishment of a data-driven culture  and the introduction and implementation of a data strategy . Another major challenge for many companies is the overview of their own data stocks. In this context, companies are confronted with the following questions: What role does my data play in value creation and how is it used in processes? What are the sources of my data and what products, services and applications are they used for ? An overview of one’s own data is playing an increasingly important role in the financial industry, too. Banks must be able to disclose their data to the regulator at any time, and regular reporting is becoming the norm . The issue of transparency is becoming increasingly important, so a solution is needed to keep track of one’s data.
In this context, the principle of “data lineage” (DL) has become established. Based on the approach of metadata management, this concept enables the structuring and visualization of one’s data inventories and sources . The following article explains the most important basics of the topic “data lineage” and shows possible areas of application in the financial industry.
Metadata is often referred to as “data about data”  and describe different types of resources in the enterprise in a structured way (e.g., data, documents, people, places, buildings, concepts) . The use of metadata guarantees that every resource has a uniform structure at the data level and can be interpreted in the same way by every user. An example underlines the explanations: in order to be able to uniquely identify all employees in a company, standardized metadata is used to describe them. In the case of the employee, this includes a unique ID, the person’s name, birthday, home address or marital status. The use of metadata enables unique identification, even in the case of matching names (employee X and Y are named Schmidt, but have a unique ID).
There are basically 5 different types of metadata (see Table 1) . Descriptive metadata (1) is used for the unique identification of resources throughout the company (see example: employees). In addition to the title of the resource, information about the date of creation or the creator can be retrieved. Administrative metadata (2) enables the management of resources. This includes information on the origin and archiving of resources. Content ratings metadata (3) reveal possible (groups of) users of the resource within the organization. In contrast, relationship metadata (4) allows to describe relationships between different resources. Thus, for example, it can be determined which products are purchased by which customers. Meta-metadata (5) is responsible for managing existing metadata. Meta-metadata is used to define formats for storage or the syntax used, i.e. the structure of the data, so that data can be interpreted, correctly displayed and processed by the software used in the company.
Metadata management is responsible for managing the metadata of an enterprise . It represents an important corporate function and has existed since the 1990s. While metadata management initially originated in IT departments (where large volumes of diverse data existed), the concept evolved over the years to become an important component of data governance[b] and of the entire enterprise . It forms the basis for the analysis and use of corporate data. It becomes apparent: in addition to the actual characteristics of the data (employee X, Y), companies must maintain the meta and context data (attributes: ID, name, birthday, etc.), otherwise important information for data processing is missing.
The concept of data lineage builds on the approaches of metadata management and aims to analyze and visualize data flows in the enterprise. Data lineage looks at the entire data life cycle (see Figure 1), from the creation to the deletion of data . However, a special focus is on the origin of the data as well as the traceability of the data up to the point of use .
Characteristics and Implementation of Data Lineage
A distinction is made between two basic types of data lineage  (see Figure 2).
Business lineage (1) provides superficial tracking of data streams, data sources, and usage locations as a function of business processes and allows users to visualize at the content level . Thus, business problems can be better understood and enriched with additional contextual knowledge and data sources. The business lineage is particularly suitable for the traceability of reports (see Figure 3). Real business processes (marketing, sales) serve as the data basis, the data used is processed in different ways and then presented in reports. The presentation of these data flows increases transparency and confidence in the reports created .
In the technical lineage (2), developers and IT administrators are given full access to the data streams and sources and are subsequently able to process and analyze them at the level of tables, rows and queues. Dependencies between data can be identified at an early stage and taken into account for the configuration of the entire IT and data landscape.
In order to implement the concept of data lineage, three basic processes must be performed. In the first step, all relevant metadata must be collected (1). In addition, the dependencies between the individual metadata are identified in this step. Subsequently, in step two, the new metadata must be consolidated and compared with the existing metadata (2). In the last step, the metadata must be represented (3) so that it can be interpreted by users. This includes the visualization of data dependencies. In this process the concept of a kowledge graph is used . With the help of a knowledge Ggraph the metadata can be represented in a network. This network shows the relations between the metadata and thus enables the users to recognize the connections in the data structures . There are two different standards that are frequently used in the context of data lineage: the OpenLineage standard and the PROV-O standard. . An example of the OpenLineage representation can be found at the end of this post.
Advantages and challenges
The end-to-end documentation, analysis and visualization of data streams enables companies to better exploit the potential of their own data. A pronounced understanding of data increases trust in the data used. Data sources and data streams are transparent for users and administrators and can be traced at any time. . This allows companies to make adjustments to data streams and the IT landscape in a time- and cost-efficient manner. For example, according to Collibra, the Technical Lineage allows IT to analyze data streams up to 98% more efficiently. . Causes of errors can also be found and corrected quickly . Furthermore, business users receive additional information that can support them in their daily work (see example: report generation) or in decision-making. This increases the data-driven value creation within the company. Another advantage of data lineage is efficient reporting to the regulator. This is supported, for example, by BCBS 239  (a standard regulation for risk reporting by credit institutions) or the GDPR (General Data Protection Regulation) . Companies can generate these reports more quickly by having a permanent overview of the critical data infrastructure. As explained at the beginning of this paper, a structured and well-defined metadata management is the foundation for successful data lineage projects. For this reason, effective metadata management can be considered a prerequisite for successful data lineage implementation. In addition to access rights, the dynamics of metadata must be considered and captured by appropriate systems. Data sources and streams can change over time and need to be constantly adapted by metadata management . In addition, the theoretical relationships of the metadata should always be documented using ontologies . Especially relationships and dependencies are very well recognizable through ontologies.
Use Case from the Financial Industry
Especially in the financial industry, the mentioned advantages of data lineage are of great importance. In addition to laws such as the GDPR, banks must comply with ever new regulatory requirements . For example, the European Central Bank’s Targeted Review of Internal Models (TRIM) requires banks to disclose their internal models used to calculate asset values and the associated risks . In this context, the concept of data lineage enables the disclosure of data streams and internal models to third parties. In addition to data processing, data protection also plays an important role in a bank. With the help of the data lineage, banks can gain insight into processes and systems in which personal data is processed and therefore requires particular protection .
Data lineage represents an important component for companies on their way to becoming data-driven enterprises. Thanks to various approaches, the representation of corporate data and its data flows can be adapted to different user groups (technical and business lineage). The most important basis in this context is the company’s metadata and its management. In addition to metadata, however, other areas of the company such as data-driven culture and data governance must be further focused on and promoted. . Gartner therefore speaks of Active Metadata, the principle of permanent analysis of metadata depending on users, data management, systems and infrastructures as well as data governance .
[a] Data-driven here means that data forms the basis for decisions in the company .
[b] Assignment of decision-making rights and associated duties in the management of data in companies .
Sources Bleiholderoc, J., “Auf dem Weg zum datengetriebenen Unternehmen: Was es bedeutet datengetrieben zu sein und welche Themenfelder wichtig sind (Teil 1/3)”, The Cattle Crew Blog, 2020. https://thecattlecrew.net/2020/03/25/auf-dem-weg-zum-datengetriebenen-unternehmen-teil-1/  Gourévitch, A., L. Faeste, E. Baltassis, and J. Marx, “Data-driven Transformation”, bcg.perspectives, 2017.  Waller, D., “10 Steps to Creating a Data-Driven Culture”, Harvard Business Review, 2020. https://hbr.org/2020/02/10-steps-to-creating-a-data-driven-culture  DalleMule, L., and T.H. Davenport, “What’s Your Data Strategy?”, Harvard Business Review, 2017. https://hbr.org/2017/05/whats-your-data-strategy  Collibra, “The top benefits of data lineage”, Collibra, 2020. https://www.collibra.com/us/en/blog/the-top-benefits-of-data-lineage  “Regulatorisches Meldewesen für Banken”, https://www.ppi.de/banken/regulatorische-anforderungen/regulatorisches-meldewesen/  “Data lineage: Data origination and where it moves over time”, Deloitte Netherlands. https://www2.deloitte.com/nl/nl/pages/financial-services/articles/data-lineage.html  Sebastian-Coleman, L., Measuring Data Quality for Ongoing Improvement: A Data Quality Assessment Framework, Newnes, 2012.  Rühle, S., “Kleines Handbuch Metadaten”, pp. 10.  Hüner, K.M., B. Otto, H. Österle, and B. Brauer, “Fachliches Metadatenmanagement mit einem semantischen Wiki”, HMD Praxis der Wirtschaftsinformatik 48(1), 2011, pp. 98–108.  Otto, B., “Data Governance”, Business & Information Systems Engineering 3(4), 2011, pp. 241–244.  Prukalpa, “The Gartner Magic Quadrant for Metadata Management was just scrapped.”, Medium, 2021. https://towardsdatascience.com/the-gartner-magic-quadrant-for-metadata-management-was-just-scrapped-d84b2543f989  “The 5 stages of Data LifeCycle Management – Data Integrity”, Dataworks, 2019. https://www.dataworks.ie/5-stages-in-the-data-management-lifecycle-process/  Collibra, “What is data lineage and why is it important?”, Collibra, 2022. https://www.collibra.com/us/en/blog/what-is-data-lineage  International, D., DAMA-DMBOK: Data Management Body of Knowledge: 2nd Edition, Technics Publications, Basking Ridge, New Jersey, 2017.  Loyens, J., “How Should We Be Thinking about Data Lineage?”, Medium, 2022. https://towardsdatascience.com/how-should-we-be-thinking-about-data-lineage-541ca5ab83d0  IBM Cloud Education, “What is a Knowledge Graph?”, 2021. https://www.ibm.com/cloud/learn/knowledge-graph  “Principles for effective risk data aggregation and risk reporting”, 2013.  “Challenges to Managing Metadata”, Data and Technology Today, 2013. https://datatechnologytoday.wordpress.com/2013/01/12/challenges-to-managing-metadata/  Missier, P., P. Alper, O. Corcho, I. Dunlop, and C. Goble, “Requirements and Services for Metadata Management”, Internet Computing, IEEE 11, 2007, pp. 17–25.  Pathak, J., D. Caragea, and V. Honavar, “Ontology-Extended Component-Based Workflows”, 2004.  Hermeling, M., “‘Back to the Roots’ – Why Data Lineage is Key for Financial Services Firms”, International Banker, 2019. https://internationalbanker.com/technology/back-to-the-roots-why-data-lineage-is-key-for-financial-services-firms/  “Targeted Review of Internal Models – TRIM”, Deloitte Deutschland. https://www2.deloitte.com/de/de/pages/financial-services/articles/targeted-review-of-internal-models-trim.html  “Active Metadata Management: Why Is It Essential in 2022?”, Atlan. https://atlan.com/active-metadata-management/  “Quickstart”, Marquez. https://marquezproject.github.io/marquez/quickstart.html
- Data Lineage: A Path to the Data-Driven Enterprise? - 26.08.2022
- Synthetic Data – The Future of Data-Driven Financial Services? - 23.12.2021
- Gaia-X – A Revolution for the Financial Industry? - 12.11.2021