Synthetic Data – The Future of Data-Driven Financial Services?
The future of banks depends on successful digital transformation. One of the biggest challenges in this context is the handling of bank-specific and personal data and its processing by artificial intelligence (AI). AI represents a disruptive technology in the financial industry and can be used along the entire value creation process . Example applications of AI in the financial industry include automated know-your-customer processes  and anti-money laundering activities. . The use of AI also makes it possible to offer data-driven services to customers. The basis of data-driven services is a high-quality and up-to-date database. This includes input factors such as the quality and quantity of data. The example of anti-money laundering illustrates this: only a high-quality database that is adequate in terms of the quantity of data enables data-driven pattern recognition of illegal money laundering activities. Above all, the quantity of correctly detected fraud cases (true positives) is often very low.
But not all companies have a large enough database to train an algorithm, and the sharing and basic use of some data is strictly limited – sometimes even within the company. In particular, the General Data Protection Regulation (GDPR) in force in Europe makes it difficult to handle personal data. Thus, the bank’s ability to gather a solid data base for using and training an AI is limited. As a result, the efficiency, trust in as well as the decision-making ability of this technology decline . For anti-money laundering activities, this means that illegal activities are difficult to detect: Depending on the sensitivity of the algorithm, fraud cases are detected too rarely (false negatives) or transactions are classified as fraud cases that are actually not (false positives). If too few fraud cases are detected, the financial resources spent on developing the algorithm seem to be wasted; if too many fraud cases are detected, the bank has to follow up on too many suspicious cases, which also takes up resources.
To counteract these problems, the concept of synthetic data has become established. The Gartner Hype Cycle 2021 classifies synthetic data as very relevant (“emerging technologies”) for AI research and expects that within the next five years the technology will be mature enough to be used by a broad mass of companies . Using this data, banks can continue to train high-quality and efficient AI models while respecting data protection regulation. In the case of anti-money laundering activities, synthetic data enables anonymous data sharing so banks can feed their algorithms with more data, which serves to identify fraud scenarios . This blog post introduces readers to the topic of synthetic data in the context of the financial industry.
Definition and Reference Process for the Creation of Synthetic Data
Synthetic data is artificially created and represents an abstraction of real data. Here, properties of existing data are analyzed and used to build an artificial data set . This abstraction allows the original data to be used anonymously . The artificial data set has the identical statistical information as well as data structures. . In the anti-money laundering example, synthetic data can be generated based on real customer data and bank transactions to detect patterns of money laundering activities. This process uses real data such as account openings, payments and purchases . The newly generated artificial data does not contain any personality-related data, yet it represents the statistical abstraction of the real customer and transaction data. Thus, necessary patterns continue to be represented by the synthetic dataset while maintaining data privacy.
Synthetic data allows the extension of existing data sets (partially synthetic data) or the creation of new artificial data sets (fully synthetic data). When extending an existing data set, only individual columns of a database table are exchanged for artificially created data. An example: the original data set contains 3 columns, these columns contain information about “location”, “time” and “transaction partner”. The partially synthetic dataset continues to use the “time” column from the original dataset, but abstracts the properties of the “transaction partner” and “location” columns. Fully synthetic datasets have no content information from the original dataset . Thus, the “time” column in the fully synthetic dataset would also be abstracted, but would have the same patterns of data within the dataset. Figure 3 shows an example synthetic dataset generated for anti-money laundering.
In summary, the advantages of synthetic data are the scalable generation of additional data, which is urgently needed as input for AI-based models. Their application additionally solves the problem of data privacy, since no personal data is included in the (fully) synthetic dataset and data in partially synthetic datasets cannot be assigned to real persons.
Different approaches exist for the creation of synthetic data. To generalize the issue, this paper presents a reference process from J. P. Morgan AI Research.
First, the original data are analyzed and various parameters (e.g. distribution function, variance, etc. ) are calculated from the data (1). By applying these parameters, properties of a data set become comparable and evaluable . Subsequently, a data generator (2) creates synthetic data (4). Statistical methods, artificial neural networks or agent-based simulations are used for this purpose. The data generator can optionally be calibrated using original data (3). In the next step, the parameters are applied to the synthetic data set (5). The comparison of the synthetic and original data after applying the parameters allows conclusions to be drawn about the quality of the artificial data. The generator can now be adjusted and optimize the quality of the synthetic data (7).
Use Cases in the Financial Industry
Synthetic data is suitable for data processing and data analysis in the bank due to the improvement of the database of AI algorithms. In addition, data anonymity allows the use of abstracted data across the banking ecosystem and enables intra- and inter-organizational collaboration. The following section examines exemplary application areas of this technology in the context of the financial industry.
Data Exchange and Collaboration
The GDPR prevents the exchange of personal data between banks and even partially within a bank . Different departments are thus unable to exploit the bank’s full data potential when evaluating data, and departments need to go through lengthy approval structures to obtain access to data . Possible internal projects for which departments need synthetic data are anti-money laundering, customer journey events or risk management , . Synthetic data also supports banks in the external exchange of data between banks and other institutions. This includes exchanges with research partners, potential business partners, and official institutions such as the government .
Money laundering refers to the channeling of money from illegal activities into the regular financial system. Synthetic data can aid the process of pattern recognition in money laundering activities by duplicating correctly identified fraud. Potential accounts, transactions, payments, withdrawals or purchases can be better identified by the AI algorithm through a larger amount of synthetic training data. Due to the anonymity of the data, datasets from different banks can be transmitted to institutions and governments for them to examine as a merged dataset. Patterns of criminal activity can still be seen in the synthetic data without compromising data protection . This enables inter-organizational processes in combating money laundering activities while maintaining data protection. Especially in money laundering detection, very large data sets are of enormous importance to avoid statistical false-positive reports .
Market Simulations and Data Gaps
Synthetic data allows banks to simulate as well as test different strategies under extreme conditions. These include market collapses or system failures on the part of the bank . Gaps in the existing documentation of such incidents can also be filled with the help of synthetic data (partially synthetic data). The example of the COVID-19 pandemic from the automotive industry shows: immediately after the 1st lockdown in 2020, many companies were overwhelmed by the challenges of the pandemic. Manufacturers had to stop work and production was at a standstill in many places . Nevertheless, data sets for this period are needed for seamless planning and control of a plant. Synthetic data allows to close these data gaps by generating artificial data.
Disruptive technologies such as AI are continuously increasing the relevance and need for synthetic data. Financial institutions can use this data to enrich internal and external data sets and to anonymize them. That said, it is not impossible that with the help of technology and knowledge, a synthetic data set can be traced back to the original data set. However, synthetic data can generate industry-wide insights, increase AI ROI, and create ecosystems .