Understanding OneLake and lakehouses in Microsoft Fabric
Tracking the annual flurry of announcements at Microsoft Build is a good way to understand what the company thinks is important for its developer customers. Build 2023 pushed artificial intelligence and machine learning to the top of that list, with Microsoft unveiling a full-stack approach to building AI applications, starting with your data and building on up.
Among the biggest news for that AI stack was the launch of Microsoft Fabric, a software-as-a-service set of tools for working with big data, with a focus on data science and data engineering. After all, building custom AI applications begins with identifying and providing the data needed to design and train machine learning models. But Fabric is also concerned with running those applications, delivering the real-time analytics needed to run a modern business.
Microsoft Fabric: A one-stop data shop
The intended audience of Microsoft Fabric covers both business users and developers, so there’s a lot to discover. Much of what’s in Fabric exists already in Microsoft Azure and the Power Platform. The key changes are a focus on open data formats and providing a single portal for working with data that can support many different use cases.
What Microsoft is doing with Fabric is bringing together many of the key elements of its data analytics stack, filling in gaps, and wrapping it all in a single software-as-a-service dashboard. Here you’ll find elements from the Azure data platform, alongside tools from the Power Platform, all wrapped up to give you one single source of truth for your enterprise data, whatever its source.
That last point is perhaps the most important. With data produced and used by many different applications, we need a common place to access and use that data, no matter how it’s stored. Fabric lets us mix structured and semi-structured data, and use relational and NoSQL stores to gain the insights we need. It’s an end-to-end enterprise data platform that can bring in data from the edge of our networks, and deliver the information people need to enterprise dashboards. At the same time, Fabric can provide the training data for our machine learning models.
The result is a single data platform that offers different user experiences for different purposes. If you’re using Fabric for analysis, you can explore data using Power Query in Power BI. If you’re looking for insights in operational data, you’re able to use Apache Spark and Python notebooks, while machine learning developers can work with data using the open source MLflow environment.
OneLake: the OneDrive for data
Microsoft Fabric is built on top of a single data platform, OneLake. Described as “OneDrive for data,” OneLake is an organization-scale data lake for all of your analytics data. That’s an important difference from other data lake products, as it takes you away from previous siloed approaches, where individual departments manage their own data lakes. All your data goes into OneLake, allowing you to provision separate data warehouses and lakehouses, in workspaces that can have centrally managed policies and security tools to ensure that data is not used inappropriately.
OneLake is based on Azure’s second-generation data lake tooling. There’s only one OneLake per tenant, with data stored in multiple containers. Each OneLake can be subdivided into many different workspaces with their own access policies, managing their own data items. OneLake is designed to host any type of file, with both web-based and desktop tools to help you explore and use your data.
You’re not limited to Azure data. Microsoft’s existing library of connectors to line-of-business applications and services ensures that you can use Fabric’s data factory tools to manage data from multiple sources. One key feature here is support for the Apache Parquet data format. Designed for large data warehouses, Parquet is a column-oriented data storage format that’s easily compressed and memory efficient, with support for high-performance column queries. Because data can be exported in Parquet format from most cloud storage services using Fabric data factory connectors, Parquet gives you a way to optimize data exports for use in Fabric’s data lake.
OneLake’s native storage format uses the Delta format for tables, an extended version of Apache Parquet, with support for transactions and with scalable metadata. It’s an open format that is able to support many different types of data source. Delta format tables are designed for large data lakes, much like Fabric’s, and offer a range of different APIs that make it easier to integrate with traditional analytics and machine learning. Using OneLake means you only need to store the data once and you can use it with your choice of query tool.
OneLake and data lakehouses
One key concept is critical to all the different use cases for Fabric: the lakehouse. A lakehouse helps you bring the data you need to one place, where it is accessible across the whole of your organization’s Azure-hosted data lake. A lakehouse gives you a way to use large amounts of data, while providing a single view that contains tools for storing, managing, and analyzing your data.
Fabric’s lakehouse implementation is designed to work with Delta tables, so you’ll need to ensure that any data in a lakehouse is in the appropriate format. Once data has been imported you can use notebooks to explore your data, using code to extract information that can be used elsewhere in your organization. Alternatively, there’s the option of using a SQL endpoint to access lakehouse data from other applications. OneLake supports working with tools like Azure Databricks and Azure HDInsight, using the existing Gen 2 Azure Data Lake Storage APIs.
Creating a lakehouse is easy enough. You can start in the dashboard or inside an existing Fabric workspace. Once created it’s ready for you to load data, with several different mechanisms available depending on your data source. While the simplest option is to upload data directly from a PC, it’s more practical to work with the built-in copy tool, which will convert data into delta tables, ready for use. You can even use Power BI’s familiar dataflow tool to bring in data from connectors to other platforms and to handle the appropriate transforms. Alternatively, you can use Apache Spark code for loading data into your lakehouses.
Real-time analytics in Fabric support time-based data in semi-structured formats. Instead of having separate tooling for long-term analysis and operational analysis, you can now work with the same data in different ways. As data arrives, operational analytics can help pinpoint issues that need immediate responses. Once stored, the same data becomes the basis of training data for machine learning as well as source data for report-based data analysis, along with data from other systems.
Dipping in and out of OneLake
Usefully, not all of your source data needs to be stored in OneLake; you can use shortcuts to link to other storage locations. Shortcuts are the data lake equivalent of a symbolic link, allowing you to work with data without hosting it in Azure. This reduces the risks associated with copying data, allowing you to control access to line-of-business systems from inside the Fabric dashboard. Once created, shortcuts are displayed as folders—a table folder of structured data, and a file folder of unstructured data. If a shortcut contains either Delta or Parquet format data it will automatically be used as a table, with Fabric loading the connection’s metadata and using it to manage the resulting table.
More and more enterprises are embracing a common repository for all of their data, and Microsoft is rushing to meet the demand with Fabric. By building on top of open standards like Delta and Parquet, Microsoft has found a way to help businesses build and manage data lakes using existing data platform skills—ready to support both data warehouse analytics and machine learning. Having a free trial while the service is in public preview makes it possible to evaluate it before making any long-term decision.