Three ingredients of innovative data governance

Three ingredients of innovative data governance

Data governance should not be the enemy of innovation. The two concepts can exist in harmony but require certain important features to be in place to be successful.

When you hear the term data governance, did you first think about draconian policies that put safety and regulations above business value? Unfortunately, this is the approach that many organizations have taken with data governance. They focus so much on restricting data to meet security and regulatory requirements that they eliminate the possibility of generating business value from data. The future of data governance must include finding ways to continue to protect data, but doing so in a way that enables organizational innovation.

Even though having a strong data governance policy and a strong culture of innovation seem contradictory, there are constructs that can be put in place to make this doable. Three of the most important practices and processes to enable innovative data governance are synthetic data, DataOps, and a walled garden for your citizen data scientists.

Synthetic data

The first important feature of innovative data governance is to provide a dataset that is statistically similar to the real dataset without exposing private or confidential data. This can be accomplished using synthetic data.

Synthetic data is created using real data to initiate a process that can then generate data that looks real but isn’t. Variational autoencoders (VAEs), generative adversarial networks (GANs), and real-world simulation create data that can provide a foundation for experimentation without leaking real data and exposing the organization to unsustainable risk.

VAEs are neural networks composed of encoders and decoders. During the encoding process, the data is transformed in such a way that its feature set is compressed. During this compression, features are transformed and combined, removing detail from the original data. During the decoding process, the compression of the feature set is reversed, resulting in a data set similar to the original data but different. The purpose of this process is to identify a set of encoders and decoders that generate output data that is not directly attributable to the initial data source.

Consider an analogy of this process: take a book and run it through a language translator (encoder), then run it through a language translator in reverse (decoder). The resulting text would be similar but different.

GANs are a more complex construct that consists of a pair of neural networks. One neural network is the generator and the other is the discriminator. The generator uses seed data to create new datasets. The discriminator is then used to determine whether the generated dataset is real or synthetic. During an iterative process, the generator improves its output to the point where the discriminator cannot differentiate the real data set from the synthetic data set. At this point, the generator can create datasets that look indistinguishable from real data but can be used for data experimentation.

In addition to these two methods, some organizations use game engines and physics-based engines to simulate data sets based on scientific principles and how real-world objects interact with scientific principles (e.g., physics, chemistry, biology). As these virtual simulations are run, the resulting data set, which is representative of real data, can be collected for analysis and experimentation.