Databricks adds data governance and marketplace functionality

Along with open source Delta Lake at its annual Data+AI Summit, data lakehouse provider Databricks on Tuesday launched a new data marketplace along with new data engineering features.

The new marketplace, which will be available in the coming months, will allow companies to share data and analytics assets such as tables, files, machine learning models, notebooks and dashboards , the company said, adding that the data does not need to be moved. or replicated from cloud storage for sharing.

The marketplace, the company says, will accelerate data engineering and app development because it allows companies to access a data set instead of developing one and subscribing to a dashboard. for analysis instead of creating a new one.

Databricks Marketplace allows users to share and monetize data

Databricks said the marketplace will make it easier for companies sharing data assets to monetize them.

The new market resembles that of Snowflake data market in design and strategy, analysts said.

“Every major enterprise platform (including Snowflake) must have a viable application ecosystem to truly be a platform and Databricks is no exception. It seeks to be a central marketplace for data assets and should be viewed as an immediate opportunity for ISVs and app developers looking to build above Delta Lake,” said Hyoun Park, chief analyst at Amalgam Insights.

Comparing the Databricks Marketplace to the Snowflake Marketplace, Doug Henschen, principal analyst at Constellation Research, said that in its current form, the Databricks Data Marketplace is very new and only deals with data sharing, both internally and externally, unlike Snowflake which added integrations and support for data monetization.

In an effort to promote data collaboration with other companies in a secure manner, the company said it is introducing an environment, dubbed Cleanrooms, which will be available in the coming months.

A data cleanroom is a secure environment that allows a business to anonymize, process, and store personally identifiable information in order to later make it available for data processing in a way that does not violate data privacy rules. privacy.

Databricks clean rooms will provide a way to share and join data between companies without the need for replication, the company said, adding that these companies will be able to collaborate with customers and partners on any cloud with the flexibility perform complex calculations and workloads using both SQL and data science tools, including Python, R, and Scala.

The promise to comply with privacy standards is an attractive proposition, Park said, adding that its litmus test will be its adoption in financial, government, legal and healthcare services sectors that have strict regulatory guidelines.

Databricks updates data engineering and management tools

Databricks has also launched several additions to data engineering tools.

One of the new tools, Enzyme, according to the company, is a new optimization layer to speed up the Extract, Transform, Load (ETL) process in Delta Live Tables that the company made generally available in April of This year.

“The optimization layer is focused on supporting automated incremental data integration pipelines using Delta Live Tables through a combination of query plan and analysis of data change requirements”, said Matt Aslett, research director at Ventana Research.

And this layer, according to Henschen, should “check off another set of capabilities expected by customers that will make it more competitive as an alternative to conventional data warehouse and data mart platforms.”

Databricks also announced the next generation of Spark Structured Streaming, dubbed Project Lightspeed, on its Delta Lake platform which it says will reduce costs and reduce latency by using an expanded ecosystem of connectors.

Databricks refers to Delta Lake as a data lake, built on a data architecture that provides both storage and analytics capabilities, unlike data lakes, which store data in native format, and data warehouses, which store structured data (often in SQL format) for fast querying.

“Streaming data is an area where Databricks differentiates itself from some of the other data lakehouse providers and is gaining more attention as real-time data-driven applications and streaming events become more common. “Aslett said.

The second iteration of Spark, according to Park, shows Databricks’ growing interest in supporting smaller data sources for analytics and machine learning.

“Machine learning is no longer just a tool for massive big data, but a valuable feedback and alerting mechanism for real-time and distributed data,” the analyst said.

Additionally, to help enterprises with data governance, the company has launched the Data Lineage for Unity Catalog, which will be generally available on AWS and Azure in the coming weeks.

“The general availability of Unity Catalog will help improve the security and governance aspects of Lakehouse assets, such as files, tables, and ML models. This is essential to protect sensitive data,” said Sanjeev Mohan, former research vice president for big data and analytics. at Gartner.

The company also launched Databricks SQL Serverless (on AWS) to offer a fully managed service to maintain, configure and scale cloud infrastructure on the Lakehouse.

Some of the other updates include query federation functionality for Databricks SQL and new functionality for SQL CLI, allowing users to run queries directly from their local computers.

Federation functionality allows developers and data scientists to query remote data sources, including PostgreSQL, MySQL, AWS Redshift, and others, without the need to first extract and load data from source systems , the company said.

Copyright © 2022 IDG Communications, Inc.