Data lakes and warehouses part 6: Microsoft Fabric as a data lakehouse technology

Since its launch in 2023, Microsoft Fabric has gained traction as a unified and open platform for analytics. Data architects Timo Aho and Jari Lahdenperä unveil how it works and what is it good for.

Timo Aho / August 16, 2024

Back in 2021, we anticipated that future seems even brighter for hybrid solutions.

This seems to have come true: especially Databricks and Microsoft are developing technology on the hybrid data lakehouse architecture. The new Microsoft Fabric is a heavy investment in this.

Previously in this blog, we have explored such technologies as Snowflake, Databricks and Azure Synapse Analytics, and introduced good implementation practices for both paradigms, data lakes and data warehouses.

In this post, we dig deeper into the environment of Microsoft Fabric. Microsoft Fabric is based on a data warehouse – data lake hybrid architecture: data lakehouse. Note: Microsoft Fabric should not be confused with Azure Service Fabric which is an unrelated technology.

Previous posts in this series are:

Part 1: Intro to paradigms

Part 2: Databricks and Snowflake

Part 3: Azure Synapse point of view

Part 4: Challenges

Part 5: Hands on solutions

Data Lakehouse architecture

Data lakehouse architecture is, by marketing terms, way to get best parts of data warehouses and data lakes.

The data is stored in an open environment in open file formats. Thus, data is, at least in theory, accessible with external tools as is typical for data lake architecture. This enables separation of storage and computation layers which allows vast elasticity on the development.

On the other hand, if the data is accessed from the designated framework, the data is organized into a database type management including schemas, tables, views and the related permission management.

Data Lakehouse is usually coupled with the medallion architecture. The architecture divides data into increasingly processed layers, typically called bronze, silver and gold, as described in our previous post.

Microsoft Fabric in general

Fabric is a novel cloud end-to-end data platform. It is a replacement for Azure Synapse framework and extends its functionality. Microsoft offers clear instructions for the migration between the environments. On the other hand, Fabric can be considered as a significant enlargement of Power BI.

Like Azure Synapse, Microsoft Fabric environment consists of multiple Azure components, both new and existing. The number of components is so vast, and increasing, that we only introduce some of them. The base of Microsoft Fabric is a concept of OneLake. It is a way to use both third party tools and Microsoft Storage solutions in a database type way. OneLake stores data, on default, in Delta (Lake) file format but also enables connections to databases and third-party sources.

In addition to OneLake, the other components include:

Power BI for data analysis and visualization
Data Factory for batch data processing
Two types of Data warehousing technologies
- SQL Analytics endpoint – an on-demand serverless read-only SQL query engine. In the current Free trial version, serverless might mean startup times up to 10-30 seconds.
- Synapse Data Warehouse for more traditional data warehousing
Synapse Data Engineering – a Spark based environment for notebooks and jobs

Diagram of the software as a service foundation beneath the different experiences of Fabric.

_{Microsoft Fabric combines both new and existing components for managing different workloads. (Source of the image: What is Microsoft Fabric - Microsoft Fabric | Microsoft Learn)}

In addition, Microsoft Fabric currently has such things as novel technology for no-code development (Data Activator), machine learning features (Synapse Data Science), fast data model sketching (Industry Solutions) and real-time analysis (Real-Time Intelligence).

The basic philosophy of Microsoft Fabric could be described as taking Power BI and make it full-blown end-to-end data platform: it includes tools from data preparation to reporting and visualization. The integration with Power BI is strong which makes Fabric data usage easy from Power BI. Based on our experience, the Power BI origin can be seen not only in the user interface but also in the licensing model.

Fabric has pay-as-you-go costs for storage and processing. These expenses might be tricky to estimate in advance. A central term is a somewhat mystic Capacity Unit (CU) compute power which consists of two Stock Keeping Unit (SKU) types: For Power BI related processing, you pay Power BI capacity (P SKU) and for other processing Fabric capacity (F SKU). Based on our initial projects, the pay-as-you-go expenses do not seem to be high. You can test everything with Fabric Trial subscription, but the supported features are limited: e.g., Microsoft Copilot is not available.

Microsoft Fabric as a Data Lakehouse solution

With Fabric, Microsoft aims to provide a single experience and take away some of the complexity associated with data platform environments. Doing the first version of your data platform is simple in Microsoft Fabric – especially if you are a Microsoft Windows user. You can drag-and-drop data files into your Fabric web browser page UI and a data lakehouse type OneLake table is created – in the default and efficient Delta file format. Adding external data sources like a Windows network folder is a few clicks away. Also, third-party sources like AWS S3, Snowflake and Databricks are supported. You can create a full-fledged medallion architecture sketch fast.

All the different tools are separated under categories like “Data Factory” and “Data Engineering”. The number of tools and categories is one of the drawbacks in the architecture: At first, you often get lost in the vast number of tools. You might spend significant amount of time to find the browser view you used just a few hours ago.

The Power BI background can also be seen in the lack of support for DevOps processes and technologies. For example, currently no Infrastructure-as-Code (IaC) tools, not even the default Azure Resource Manager (ARM) templates, are supported for most of the Microsoft Fabric components. If you plan to have a completely separated production and development Fabric environments, you might need a bit of tuning. However, Fabric does have its own lifecycle management tools and Git integration.

One more limitation with Fabric is the lack of detailed security and authorization feature: only a small number of RBAC roles is currently supported.

Comparison to Databricks

Databricks is a somewhat similar solution and probably a reference for Microsoft when they designed Fabric. It is important to note that Databricks is still a first-party service on Microsoft Azure. Both Databricks, the company, and Microsoft are eager to stress that the tools are very well integrated, and the companies are more partners than competitors.

In comparison with Fabric, Databricks is more like a Swiss army knife. It provides more flexibility but needs a lot of work for a beginner to start with. For example, creating an empty Unity catalog-based data lakehouse might need following steps:

enabling Unity catalog for your workspace
create Azure Storage account for your Unity catalog
create Unity catalog Storage credential using an Azure service principal with correct permissions
create Unity catalog External location to the Azure Storage using the Storage credential
create the actual Unity catalogs and schemas
create Unity catalog enabled compute cluster for the users

Databricks is a multi-cloud solution that was originally designed for data engineering with Scala, Python and SQL. The dashboard support is still far from Microsoft’s Power BI, even if analytics user features are improving significantly.

Even though Apache Spark, an open source tool, is used behind-the-scenes, it is well hidden for users who do not want to know about it: serverless SQL warehouses, graphical lineage graphs and database exploring are more than enough. Moreover, SQL can be used to do nearly everything, like in defining a full streaming data ingestion platform with Delta Live tables.

Summary

Microsoft Fabric is a data lakehouse technology that offers a database-like user experience, bringing together tools for data analysis, data processing, and data warehousing, among others. Even if Fabric is a fresh tool, it has tight integration with Power BI which is a mature technology on its own. Fabric component OneLake is used to store data in Delta (Lake) files enabling separation of storage and computation layers.

In comparison with Databricks, a similar solution, we note that Fabric focuses more on offering a user-friendly end-to-end data platform. Databricks requires more effort from the beginners. When Databricks offers a full-fledged developer experience, Fabric is graphically-oriented and has very limited support for such things like Infrastructure-as-Code (IaC).

Overall, Microsoft Fabric provides a comprehensive data platform, covering everything from data preparation to reporting and visualization. It offers a way to use both data lake and database concepts, giving users a wide range of tools and capabilities. It seems like Microsoft focus has shifted from Synapse framework to Fabric, so the latter will probably be under strong development for a while.

Jari Lahdenperä

Lead Data Architect, Tietoevry Create

Jari is a hands-on data and cloud professional. His expertises are in designing globally available and secure cloud data solutions in Azure and AWS. Lately, he has focused on lakehouse architectures in Databricks and Microsoft Fabric.

Timo Aho

Cloud Data Expert, Tietoevry Create

Timo is a cloud data expert (PhD) with over a decade of experience in modern data solutions. He enjoys trying out new technologies and is particularly interested in technologies of storing, organizing and querying data efficiently in cloud environments. He has worked in data roles both as a consultant and in-house.