This blog continues our series of insights into the Cyber, Identity and Privacy (CIP) sector of our RegTech Taxonomy. Previously, we have looked at synthetic data being the key to opening new opportunities, the importance of identity in a crisis, explored the intersection between Cyber Security and Data Privacy, examined the flurry of M&A activity in this space, and investigated the often missed opportunity relating to Financial Market Infrastructure providers.
This piece, written by guest writer Barry Smith, an independent consultant, talks in more detail about the importance of ‘good’ synthetic data to test models, how tools produce masked data sets, residual risks in using synthetic data, and the different vendors in the market and the technologies used in their tools.
The importance of ‘good’ synthetic data
Synthetic data tools have a huge potential to reduce data privacy risks. When developing models using advanced analytical techniques – especially techniques like machine learning – data synthesis can address the difficult trade-offs between model accuracy on the one hand and compliance with data protection requirements (such as the General Data Protection Regulation or GDPR) on the other.
The tradeoff arises because advanced analytical models are calibrated or “trained”, tested and validated using large amounts of sample data. The more sophisticated the modelling technique, the larger the volume of data that is needed to train and re-train the model over time so that it performs accurately.
Where personal data is used for developing models, or indeed any other type of data that is sensitive for legal or commercial reasons, it’s essential that unauthorised access is prevented during the model development, testing and validation process. Traditionally, data masking or “pseudonymisation” has been used for this purpose.
However, to train, validate and maintain a model that will perform accurately under real operational conditions, the data that is used must be very similar to the original, preserving not just the statistical properties of the data within a single column of a data set, but the properties of the data set as a whole. This includes non-obvious underlying patterns in the data of the kind that might be discovered by a sophisticated neural network.
The risk is that the closer the masked data set is to the original and the better the chance of building a highly accurate model, the more likely it is that unauthorised identification of a data subject is possible. Even when the obvious identifiers are masked, re-identification may still be possible. This could be achieved by making use of combinations of identifiers that together provide a high-probability link to the original data subject. This is a particularly high risk for outliers – for example, a person in a sparsely populated area and/or with a rare medical condition or a person of very advanced age.
Ample opportunities balanced against unavoidable risks
The idea behind data synthesis tools is that instead of masking existing data records one-by-one, the tool processes a real input data set in its entirety and “learns” its statistical properties. It then produces a completely new data set that reproduces those properties without any one-to-one link with the original records. An arbitrary number of output records, independent of the size of the original data set, could be produced – although of course, there must be a sufficient number of input records in the input data set for the method to work reliably.
There is still a residual privacy risk: this is ultimately unavoidable. A data synthesis tool might by chance produce a record that is identical to, or more likely just “too similar” to, a record in the input data. Some data synthesis tool vendors provide checks to prevent this, based on proximity measures. Some even allow the privacy-fidelity trade-off to be fine-tuned by means of a parameter in the tool. This can be set by users in order to reflect the risk tolerance of the organisation, model accuracy requirements, the sensitivity of the data itself and the maturity of the additional data privacy safeguards that are in place.
What’s on the market?
Among the properties that most tool vendors aim to reproduce, apart from the statistical properties of the data in a single column (the “univariate” distributions in the data) are, for example:
- correlations for each pair of columns (the “bivariate” distributions); and
- time-dependencies for time-series data.
Beyond this, a majority of tool vendors make use of advanced modelling techniques such as deep neural networks to learn the structure and properties of the input data – with the potential to reflect more complex and hard-to-discover underlying patterns.
All the data synthesis tool vendors use advanced proprietary algorithms to build data generators; in most cases, these are based on neural networks such as GANs (generative adversarial networks) – but not all. One vendor uses a proprietary method developed from a technique known as K-Nearest Neighbours.
There is an increasing number of vendors in the data synthesis tools arena, such as Diveplane, Mostly AI, Leapyear, Hazy AI and Statice. They all offer quality assurance functionality, in various forms, such as:
- reports on univariate and bivariate statistical properties, comparing them between input and output data sets;
- reports on outliers with low numbers of instances;
- reports on output data set fidelity using proprietary “similarity” metrics;
- reports on model accuracy, comparing various accuracy metrics for standardised models that are built automatically with the input data versus the same models built using the generated output data.
Although the majority of vendors make use of cloud-deployable technologies such as Docker containers, for some use cases, it can also be critically important to be able to deploy the tool itself and generate synthetic data on-premises, while transferring the generated data to the cloud. Some vendors allow the user to build a compact data generator on-premises using the sensitive input data, transfer the generator to the cloud and then generate large volumes of synthetic data in situ.
We are currently working with regulated institutions advising them on the best ways to evaluate and utilise synthetic data across a variety of different use cases. If you would like to speak to us please drop us an email at firstname.lastname@example.org.
Barry has over 30 years’ experience in the financial services sector as an independent consultant, specialising in risk management, analytics, data science and regulatory change. As a subject matter expert with in-depth knowledge of credit risk, market risk, ERM and regulatory compliance, Barry manages both small and large-scale projects. He provides advisory services to clients up to board-level.