This article was originally posted on LinkedIn
Data ingestion is one of the first and most critical hurdles to overcome in architecting a data platform but it’s not exactly the most exciting. I’ve been meaning to dig into Lakeflow Connect for a while now, which is Databricks’ new built-in data ingestion tooling. Here’s what I like about it:
Speed to ingestion ➡️ You can start with one of the fully-managed connectors to get data in fast (The SQL Server connector just went GA) - https://lnkd.in/eq_Mfr8H
It’s efficient ➡️ Leaning on Autoloader to make incremental reads and writes easy. Meaning you only pull what’s new, not the whole dataset every time.
Unity Catalog ➡️ This is by far the biggest benefit of Lakeflow Connect. Integration with Unity Catalog and Lakeflow Jobs brings ingestion, orchestration, and lineage under one roof
Flexible operating model ➡️ You can pick managed connectors, use declarative pipelines, or drop to Structured Streaming when you want to get your hands dirty.
In Summary: Lakeflow Connect is a big step to fill in the gaps that lifts Databricks from technical compute platform to a strategic enabler for businesses, so it’s a step in the right direction and a well executed one at that.
There are obviously more mature tools for data ingestion such as Azure Data Factory, Airflow, Fivetran, Kafka, etc. Those are all great in their own way (mostly) but its another tool to incorporate into your platform, another cost to monitor and another integration to manage. So be aware that it’s going to take time for that list of connectors to build up and come out of preview. You might be rolling your own custom pipelines for edge cases for a while.
Check it out and let me know what you think in the comments: https://lnkd.in/ejs_nxMK



