Organizations today are increasingly data-driven. With advances in technology and falling storage costs, the volume of data is growing exponentially every day. While data storage is no longer a concern, managing acquired data has become a problem.
Data governance is essential for every organization, but not everyone has had success implementing a good data governance strategy. According to Gartner, “Each year, poor data quality costs businesses an average of $12.9 million. Besides the immediate impact on revenue, in the long term, poor quality data increases the complexity of data ecosystems and leads to poor decision-making.
Data governance has two components: data catalog and data lineage.
According to Oracle, one of the world’s largest database vendors, a data catalog is an “organized inventory of data assets” that uses metadata to help organizations manage their data. In contrast, data lineage is how these data assets are created and connected to each other so that an audit trail can be formed.
These concepts seem simple, but why exactly are organizations not managing data or doing it the wrong way?
Challenges companies face in improving data governance capabilities:
- It is not easy to keep a single source of truth.
Organizations can easily create duplicates of their data. If this data is synchronized with the data catalog tool, it can cause confusion for users who do not know which version of the table is the latest.
Keeping up with the ever-increasing amount of data is not easy. New ETL flows and processes complicate the management of data definitions. Most of the time, organizations lack a dedicated team to manage metadata.
- Several developers work on a project and work in silos without knowing how others use the data.
In order to create a data audit trail, one must understand how the data is used. Without code standardization or communication between developers, code analysis tools are useless and will not generate a data link.
Code evolves faster than documentation can keep up.
As data grows, code also grows exponentially, especially in a large team. If the code develops without any standardization or automated means of extracting lineage that can adapt to different development favors, it will be impossible to maintain a lineage diagram.
When choosing data catalog tools, it is important to consider the following factors:
- The tool conforms to industry standards
- It does not require much developer intervention
- It can work with other existing tools
- Tight integration with code
Databricks is known for its excellence in data processing. Recently, Databricks released new frameworks to make data governance simpler and more efficient. For example, Databrick’s Delta Live Tables (DLT) is a framework created by Databricks to perform data cataloging and lineage within the Databricks ecosystem, and the tool is also compliant with industry standards.
Delta Live Tables is not just a data governance tool; it also supports many distinctive features such as streaming tables, audit logs, QA and ETL framework among others. However, in this article, we will focus on the data governance aspect of DLT.
Delta Live tables support SQL and Python. However, to take advantage of this framework, a specific syntax must be followed.
To create a Delta Live Table in SQL, the only thing that needs to change is to use the LIVE keyword as follows:
Python’s syntax is more complex; however, by just adding a declaration, you can make it simpler:
So what’s the magic behind this LIVE keyword if we’re not streaming?
Delta Live tables go far beyond streaming.
Delta Live Table Features
- Continuous or Triggered Pipeline – Whether you are running a streaming job or just performing a one-time load, you can use Delta Live Tables.
- Validations – Set expectations directly in the table definition, so there’s no need to set up another validation pipeline. An example of an expectation is “revenue is greater than 0”. You can choose to delete or keep the records or even stop the pipeline.
- Data lineage – You no longer need a separate tool to generate a lineage diagram. Instead, you can easily migrate SQL or Python notebook to DLT format and take advantage of the lineage that comes with the data pipeline. It will generate by itself without the need to integrate them with a tool.
- The “Development and production” mode gives you the possibility to test your code without affecting production tasks.
- Enhanced autoscaling is an advanced scaling mechanism that will allow you to automatically start or stop a cluster, saving you more money.
- Logging and Monitoring – DLT comes with a logging and monitoring dashboard that allows you to track job status step by step without having to create other monitoring tools.
Let’s look at a retail pipeline developed by Databricks:
The example above highlights four features:
- Streaming Pipeline
- Data validation
- Data lineage
- Validation Dashboard
The syntax for creating a streaming pipeline is:
This raw pipeline just tries to stream the JSON files from the specified location. Therefore, it is now easier to create a streaming pipeline using DLT.
The next step is to perform data cleansing. Traditional ETL requires separate steps for error handling and data validation. As a result, this logic will be written into the SQL query and other developers will try to decode the purpose. In DLT there is a descriptive way to handle these records called waiting. The syntax is as follows:
A flowchart in DLT work shows how data moves from one place to another. Therefore, it is not necessary to run it through another analysis tool to generate these diagrams.
See the article
Each step automatically provides a summary of expectations and data quality checks, saving time on creating and maintaining additional toolkits. The time required to evaluate code to ensure data validation is also reduced by making it available automatically.
Databricks Delta Live Tables help companies improve their data governance capabilities
Data teams are constantly on the move. However, with Databricks Delta Live Tables, they can streamline reliable data pipelines and quickly find and manage enterprise data assets across various clouds and data platforms. Additionally, they can simplify enterprise-wide governance of data assets, both structured and unstructured.
We illustrated how to use Databricks Delta Live Tables to solve data governance issues. For an in-depth technical analysis and all the features available in Delta Live Tables, click here.