ONELAKE: The core component of Microsoft Fabric

Blogs

Cross Database Querying in Postgres SQL
August 25, 2024
Why columnar databases are better for analytics
August 26, 2024

ONELAKE: The core component of Microsoft Fabric

Introduction on OneLake
OneLake is a single data lake for the entire organization on Microsoft Fabric. It is a SaaS service that comes automatically with Microsoft Fabric tenant, so there is no need to create or purchase OneLake separately. Each fabric tenant has a single OneLake instance. Data lake contains large amount of data from various sources.  

OneLake is a logical storage solution. The physical storage backend is ADLS Gen2(Azure Data Lake Storage Gen2). The data stored in OneLake is in single format: Delta. OneLake supports both structured and unstructured file formats. A single organization can have OneLake to store all its data, which can then be analyzed using various analytics tools. Before OneLake, organizations faced the challenge of managing multiple data storage resources. With OneLake, they need to manage only one resource for all their data.  

Basic Terms Used in Microsoft Fabric

  1. Tenant: A tenant is a single instance of Microsoft Fabric for an organization.
  2. Workspace: Within the tenant, you can create any number of workspaces. Access policies for users can be enabled through each workspace. Each workspace has its own capacity, and this capacity is billed separately.
  3. Data Items: Within a workspace, you can create any number of data items, which allow you to access the data stored in OneLake. Data Items includes Pipelines, Notebook, Data Lakehouse and Warehouse

Data Lakehouse vs Data Warehouse 

Data Lakehouse:
A Lakehouse is a component in fabric where you can store, manage, and analyze  structured, semi-structured, and unstructured data in a single location. The Lakehouse enables you to perform data transformations using spark compute or utilize SQL endpoints for data analysis and exploration. Fabric Lakehouse uses delta as its default file format, ensuring optimal performance for analytical workloads.
Data can be stored in 2 physical locations: Files and Tables

Files: In a Lakehouse environment, this refers to an unmanaged area for storing data in any file format. If delta tables are stored in this location, they are not considered as tables. To use the delta table, you need to create a shortcut or an external table that points to the location in the unmanaged folder where delta lake files are available.

Tables: This is the managed area in the Lakehouse where data is stored in tables of various formats (CSV, Parquet or Delta). Any delta tables with a file-based transaction log are also considered as tables. The automatic table discovery and registration process will run only over folders created in the managed area. 

Data Warehouse:
Warehouse is a component in fabric that stores data in OneLake, allowing interaction through SQL queries. The Warehouse supports only structured data. Users can create tables in warehouse, load data, perform transformation and query the data. The Data Warehouse is a SQL Engine equipped with Delta Tables and endpoints, providing T-SQL DDL and DML support for data manipulation and analysis. 

When to use Lakehouse or Warehouse

When choosing between a Data Lakehouse and Data Warehouse, it’s essential to find that both utilize the open data format delta parquet. However, the decision should be based on your specific data requirements. 

  1. Type of Data: Both Data Lakehouse and Warehouse can handle unlimited data volumes. The Warehouse supports only structured data, while the Lakehouse supports unstructured, semi structured and structured data. 
  2. Engines: The Data Lakehouse uses Spark engine, while warehouse is built upon Polaris engine which is an interactive relational query engine. 
  3. Data Hierarchy: In the Lakehouse, data is organized into folders, files, databases and tables. In Warehouse, data is organized by databases, schemas and tables. 
  4. Write/Read Operations: In Lakehouse you use spark for writing data and handling tasks, with T-SQL is used for read-only operations through the SQL analytics endpoint. In warehouse T-SQL is used for both writing and reading data. 
  5. Shortcuts: Lakehouse supports shortcuts, but warehouse does not. Shortcuts are objects in OneLake that point to storage locations. These locations can be either internal or external to OneLake. 
  6. If you need to migrate data warehouse from sources like SQL server, Azure SQL DB, or synapse Analytics, using a warehouse might be the easiest option because it allows you to move most of the existing code with minimal difficulty. 
  7. If you need to support additional use cases beyond data warehousing such as real time data processing or machine learning, a Lakehouse would be a better option compared to warehouse. 
  8. Dynamic Data Masking: In Lakehouse, you can use the SQL analytics endpoint to hide sensitive data from users, but the spark engine can’t do this. The spark engine also doesn’t support object or row level security. In data warehouse dynamic data masking features are fully supported. This includes object level, column level, row level as well as DDL/DML operations. Dynamic data masking hides sensitive data from users who don’t have permission to see it. Administrators can decide how much of the data is shown to different users. 

Note: In a Lakehouse with a SQL analytics endpoint that only allows reading data, you can’t delete tables using the endpoint. To delete tables, you’ll need to use a notebook. 

Combination of both (Medallion architecture) 

The Medallion architecture consists of three distinct data layers or zones, each representing a different level of data quality within the Lakehouse. Higher levels represent higher data quality. The three layers are Bronze (raw), Silver (validated), and Gold (enriched/curated) which is explained further for better understanding. This architecture ensures that data follows the ACID properties (Atomicity, Consistency, Isolation, and Durability) across all layers. The goal is to improve the structure and quality of data at each stage.

Bronze Layer: This is the raw zone, the first layer that stores data in its original format. The data in bronze layer is append only. User interaction in this zone is minimal, as it represents the source data.

Silver Layer: This is the validated zone, which stores data from the bronze layer. The raw data from the bronze layer undergoes cleansing and standardization processes. Data in this zone is structured as tables (rows and columns). In this zone data quality and consistency are prioritized over data modelling.

Gold Layer: This is the enriched zone, the final layer of the architecture, sourced from the silver layer. The data in gold layer is “business ready” data, it is ready for analytical and business needs and is directly used in decision making processes. 


Chandana R L

Leave a Reply

Your email address will not be published. Required fields are marked *