Introduction on OneLake
OneLake is a single data lake for the entire organization on Microsoft Fabric. It is a SaaS service that comes automatically with Microsoft Fabric tenant, so there is no need to create or purchase OneLake separately. Each fabric tenant has a single OneLake instance. Data lake contains large amount of data from various sources.
OneLake is a logical storage solution. The physical storage backend is ADLS Gen2(Azure Data Lake Storage Gen2). The data stored in OneLake is in single format: Delta. OneLake supports both structured and unstructured file formats. A single organization can have OneLake to store all its data, which can then be analyzed using various analytics tools. Before OneLake, organizations faced the challenge of managing multiple data storage resources. With OneLake, they need to manage only one resource for all their data.
Basic Terms Used in Microsoft Fabric
Data Lakehouse vs Data Warehouse
Data Lakehouse:
A Lakehouse is a component in fabric where you can store, manage, and analyze structured, semi-structured, and unstructured data in a single location. The Lakehouse enables you to perform data transformations using spark compute or utilize SQL endpoints for data analysis and exploration. Fabric Lakehouse uses delta as its default file format, ensuring optimal performance for analytical workloads.
Data can be stored in 2 physical locations: Files and Tables
Files: In a Lakehouse environment, this refers to an unmanaged area for storing data in any file format. If delta tables are stored in this location, they are not considered as tables. To use the delta table, you need to create a shortcut or an external table that points to the location in the unmanaged folder where delta lake files are available.
Tables: This is the managed area in the Lakehouse where data is stored in tables of various formats (CSV, Parquet or Delta). Any delta tables with a file-based transaction log are also considered as tables. The automatic table discovery and registration process will run only over folders created in the managed area.
Data Warehouse:
Warehouse is a component in fabric that stores data in OneLake, allowing interaction through SQL queries. The Warehouse supports only structured data. Users can create tables in warehouse, load data, perform transformation and query the data. The Data Warehouse is a SQL Engine equipped with Delta Tables and endpoints, providing T-SQL DDL and DML support for data manipulation and analysis.
When to use Lakehouse or Warehouse
When choosing between a Data Lakehouse and Data Warehouse, it’s essential to find that both utilize the open data format delta parquet. However, the decision should be based on your specific data requirements.
Note: In a Lakehouse with a SQL analytics endpoint that only allows reading data, you can’t delete tables using the endpoint. To delete tables, you’ll need to use a notebook.
Combination of both (Medallion architecture)
The Medallion architecture consists of three distinct data layers or zones, each representing a different level of data quality within the Lakehouse. Higher levels represent higher data quality. The three layers are Bronze (raw), Silver (validated), and Gold (enriched/curated) which is explained further for better understanding. This architecture ensures that data follows the ACID properties (Atomicity, Consistency, Isolation, and Durability) across all layers. The goal is to improve the structure and quality of data at each stage.
Bronze Layer: This is the raw zone, the first layer that stores data in its original format. The data in bronze layer is append only. User interaction in this zone is minimal, as it represents the source data.
Silver Layer: This is the validated zone, which stores data from the bronze layer. The raw data from the bronze layer undergoes cleansing and standardization processes. Data in this zone is structured as tables (rows and columns). In this zone data quality and consistency are prioritized over data modelling.
Gold Layer: This is the enriched zone, the final layer of the architecture, sourced from the silver layer. The data in gold layer is “business ready” data, it is ready for analytical and business needs and is directly used in decision making processes.
Chandana R L