“Databricks Introduces LakeFlow: Empowering Clients to Develop Seamless Data Pipelines”

With LakeFlow, Databricks users will soon be able to build their data pipelines and ingest data from databases like MySQL, Postgres, SQL Server and Oracle, as well as enterprise applications like Salesforce, Dynamics, Sharepoint, Workday, NetSuite and Google Analytics. In a way, getting data into a data warehouse or data lake should indeed be table stakes because the real value creation happens down the line. The first is LakeFlow Connect, which provides the connectors between the different data sources and the Databricks service. It’s fully integrated with Databricks’ Unity Data Catalog data governance solution and relies in part of technology from Arcion. Databricks is rolling out the LakeFlow service in phases.

Databricks, a leading data and AI company launched in 2013, has relied on its extensive partner network to provide tools for data preparation and loading. However, at their annual Data + AI Summit, the company surprised everyone by announcing LakeFlow – their very own data engineering solution. This exciting new tool can handle data ingestion, transformation, and orchestration, completely eliminating the need for third-party solutions. This bold move marks a significant shift in Databricks’ strategy, as they are now directly entering a space that was previously filled by their partners.

“Everybody in the audience said: we just want to be able to get data in from all these SaaS applications and databases into Databricks,” said Ali Ghodsi, co-founder and CEO of Databricks.

LakeFlow empowers Databricks users to build their data pipelines and seamlessly ingest data from various databases, such as MySQL, Postgres, SQL Server, and Oracle, as well as popular enterprise applications like Salesforce, Dynamics, Sharepoint, Workday, NetSuite, and Google Analytics. This move was influenced by the feedback of Databricks’ advisory board at the Databricks CIO Forum, where the focus on data ingestion was heavily emphasized.

“I literally told them: we have great partners for that. Why should we do this redundant work? You can already get that in the industry,” Ghodsi recalled.

While building connectors and data pipelines may seem like a commoditized business, it became apparent that a large percentage of Databricks customers were not utilizing the company’s ecosystem partners. Instead, they were developing their own custom solutions to cater to their specific edge cases and security requirements. This realization prompted Databricks to explore possibilities in this space, ultimately leading to the acquisition of the real-time data replication service Arcion in November of last year.

Databricks plans to “continue to double down” on their partner ecosystem, but it is clear that there is a market demand for a service like LakeFlow, integrated directly into the platform. As Ghodsi puts it, “This is one of those problems they just don’t want to have to deal with. They don’t want to buy another thing. They don’t want to configure another thing. They just want that data to be in Databricks.”

The promise of LakeFlow is to provide an end-to-end solution for enterprises. It allows them to seamlessly take data from various systems, transform and ingest it in near real-time, and then build production-ready applications on top of it. The core of LakeFlow system consists of three parts – LakeFlow Connect, Flow Pipelines, and LakeFlow Jobs.

The first part, LakeFlow Connect, offers connectors between different data sources and the Databricks service, fully integrated with the Unity Data Catalog data governance solution. It also utilizes technology from Arcion and is designed for scalability, capable of handling very large workloads. Supported data sources currently include SQL Server, Salesforce, Workday, ServiceNow, and Google Analytics, with MySQL and Postgres coming soon.
The second part, Flow Pipelines, is an evolution of Databricks’ Delta Live Tables framework and lets users implement data transformation and ETL in either SQL or Python. It offers a low-latency mode for data delivery and supports incremental data processing for syncing changes to original data.
The third part, LakeFlow Jobs, serves as the engine for automated orchestration and ensures data health and delivery. “Jobs” orchestrates other actions in Databricks, such as updating dashboards or training machine learning models using the ingested data.

Ghodsi also acknowledged that many Databricks customers are looking to cut costs and consolidate the services they pay for, a trend that has become increasingly popular in the past year. The integrated service for data ingestion and transformation aligns perfectly with this trend and provides significant value to enterprises.

Databricks is rolling out the LakeFlow service in phases, starting with the LakeFlow Connect, which will be available as a preview in the near future. You can sign up for the waitlist here.