Best Practices to follow for a Delta Table Creation

Blogs

ADW TO DELTA MIGRATION
November 5, 2023
How to setup transactional replication when the publisher is part of Always On Availability Group
December 4, 2023

Best Practices to follow for a Delta Table Creation

This Blog consists of a detailed documentation on what are the best practices and Optimization to be followed for Delta tables which will improve the read performance. 

Step 1: Physical Partitioning 

  1. Partition Definition: 
  • A partition in the context of Delta tables refers to a specific subset of rows within a table. These rows share a common value for one or more predefined columns known as the “partitioning columns.” 
  • For instance, you might have a table of sales data, and you choose to partition it by the “year” and “month” columns. In this case, each partition represents sales data for a specific year and month. 
  1. Query and Data Manipulation Speed: 
  • The use of partitions can significantly improve the performance of both querying and manipulating data within a Delta table. 
  • When you perform queries that involve filtering, sorting, or aggregating based on the partitioning columns, the database engine can take advantage of these partitions to narrow down the search space efficiently. This means that fewer rows need to be processed, resulting in faster query execution. 
  • Similarly, operations like data inserts, updates, or deletes that target specific partitions are more efficient because they only affect a subset of the data rather than the entire table. 
  1. Skipping Unnecessary Data: 
  • One of the key advantages of proper partitioning is the ability to skip reading unnecessary data when filtering on columns with low cardinality. 
  • For example, if your Delta table is partitioned by the “year” column, and you’re interested in retrieving data for the year 2022, the database engine can skip reading partitions corresponding to other years (e.g., 2021 or 2023). This skip operation dramatically reduces I/O and processing overhead. 
  • This optimization becomes particularly valuable when dealing with large datasets, where reading unnecessary data can be resource-intensive and time-consuming. 
  1. Defining Partitions: 
  • To utilize partitions effectively, you define them when you create a Delta table. This is achieved by specifying the set of columns that will serve as partitioning columns using the PARTITIONED BY clause. 
  • In the SQL or DDL (Data Definition Language) statement that creates the table, you explicitly indicate which columns should be used for partitioning. 

Step 2: Optimization 

  1. Optimizing Existing Delta Tables: When a Delta table already exists, it’s advisable to perform optimization on the entire table. This process compacts all existing files, albeit taking more time. 
  1. The result is the consolidation of all smaller files into larger files, typically 1GB in size. This consolidation offers several benefits: 
  1. Enhanced read performance. 
  1. Reduced metadata overhead, which can be beneficial for overall efficiency. 
  1. There are two optimization options to consider: 

Option 1: If you are uncertain about which columns are frequently used in queries, you can use a straightforward optimization command: 

OPTIMIZE delta_table_name  

Option 2: If you have a clear understanding of which columns are frequently used in queries, you can employ a more advanced optimization method. Using ZORDERBY, you can co-locate data to achieve even better performance: 

OPTIMIZE delta_table_name ZORDER BY col1, col2, col3  

In this case, it’s important to prioritize the more frequently used columns first. 

Regular Optimization: 

  1. Following the initial one-time optimization activity mentioned above, it is advisable to continue optimizing the Delta table on a daily or periodic basis, such as every few hours. 
  1. Regular optimization helps maintain the table’s performance by consolidating files and managing metadata efficiently. 
  1. In summary, optimizing Delta tables involves initial optimization to consolidate files and reduce metadata overhead. There are two options: one for cases where you are unsure about frequently used columns and another for cases where you have a clear understanding of column usage. After the initial optimization, it is recommended to continue optimizing the table regularly to ensure ongoing performance benefits. 

Step 3: Auto Optimization 

Enabling Write Optimization: 

  1. To enhance the performance of your Delta table, it’s recommended to enable write optimization by setting the appropriate table property. 
  1. When data is loaded into the table, the auto-optimize feature automatically consolidates smaller partitions into larger ones and stores data in bigger files. 
  1. This process optimizes the table for better query performance and resource efficiency. 
  1. You can activate this feature using the following SQL command: 

ALTER TABLE delta_table SET TBLPROPERTIES (delta.autoOptimize.optimizeWrite = true)  

Consideration for Low Latency: 

  1. It’s important to note that enabling auto compaction may not be necessary in all scenarios. 
  1. If your data loading process prioritizes low latency, you may choose not to enable auto compaction. This decision depends on your specific use case and performance requirements. 

In summary, Step 3 involves enabling auto optimization for write operations on your Delta table. This feature automatically consolidates partitions and organizes data into larger files, leading to improved query performance and resource efficiency. However, the choice to enable auto compaction should consider your latency requirements and use case. 

Step 4: Auto Compaction (Optional) 

When to Consider Auto Compaction: 

  1. Auto compaction is an optional feature that you may consider using in specific scenarios. 
  1. It becomes relevant when there is a need to optimize the Delta table after each write operation, with the primary goal of enhancing read performance. 
  1. However, it’s important to note that enabling auto compaction might result in additional time being spent on data loading into Delta. 
  1. The decision to use auto compaction should be based on business requirements. If there is no strict need for faster data loading, you may choose not to activate this option. 
  1. You can enable auto compaction using the following SQL command: 

ALTER TABLE delta_table SET TBLPROPERTIES (delta.autoOptimize.autoCompact = true)  

Regular OPTIMIZE and ZORDER BY Recommendations: 

  1. Even if you enable write optimization and auto compaction, it’s advisable to schedule regular OPTIMIZE operations to further consolidate files and reduce Delta table metadata. 
  1. It’s worth noting that Auto Optimize does not support Z-Ordering. Therefore, it’s essential to continue scheduling OPTIMIZE…ZORDER BY jobs periodically to maintain optimal performance. 

File Size Considerations: 

  1. Optimizing write operations and auto compaction generally result in the generation of smaller files, typically around 128 MB in size. 
  1. In contrast, standard OPTIMIZE operations create larger files, usually around 1 GB in size. 

In summary, Step 4 introduces the optional feature of auto compaction for Delta tables, which optimizes the table after each write operation to enhance read performance. However, it may increase data loading time, so its usage should align with specific business requirements. Even with auto compaction enabled, it’s recommended to continue scheduling regular OPTIMIZE operations to consolidate files and manage metadata. Additionally, it’s important to remember that Z-Ordering should still be performed separately for optimal performance. 

 

Step 5: Delta Stats at File Level 

  1. Default Column Statistics: 
  • By default, when using Delta Lake on Databricks, statistics are collected for the first 32 columns defined in your table schema. 
  • You have the flexibility to modify this behavior by adjusting the delta.dataSkippingNumIndexedCols table property. 
  • However, it’s essential to be mindful that collecting statistics for additional columns incurs extra overhead during file writes. 
  1. Managing Statistics for Long Strings: 
  • Collecting statistics on long string columns can be resource-intensive. 

To mitigate this, you can take one of two approaches: 

  • Configure the delta.dataSkippingNumIndexedCols table property to exclude long string columns from statistics collection. 
  • Alternatively, you can relocate columns containing long strings to a position in the schema beyond the specified delta.dataSkippingNumIndexedCols. This can be accomplished using the ALTER TABLE ALTER COLUMN statement. 
  1. Sample ALTER Commands for Reordering Columns: 
  • To illustrate the process of moving key fields within your schema, you can use ALTER TABLE statements. Here are some sample commands for reordering columns: 

Example 1: Moving the “id” column to be the first column in the table: 

ALTER TABLE databricks_agg.oms_order_line ALTER COLUMN id FIRST;  

Example 2: Placing the “order_created_on” column after the “id” column: 

ALTER TABLE databricks_agg.oms_order_line ALTER COLUMN order_created_on AFTER id;  

In summary, Step 5 deals with collecting statistics at the file level in Delta Lake. You can control which columns are included in statistics by adjusting the delta.dataSkippingNumIndexedCols table property. Additionally, managing statistics for long string columns is essential for efficient operation, and you can achieve this by either excluding them from statistics or relocating them within the schema using ALTER TABLE commands. 

 

Step 6: Z-Ordering 

  1. Introduction to Z-Ordering: 
  • Z-ordering is a technique used to co-locate related information within the same set of files. Delta Lake on Databricks automatically employs this technique within its data-skipping algorithms. 
  • The primary objective of Z-ordering is to significantly reduce the amount of data that Delta Lake on Databricks needs to read during query operations. 
  • To implement Z-ordering, you specify the columns on which to apply this technique using the ZORDER BY clause within an OPTIMIZE command. 
  1. When to Use Z-Ordering: 
  • Z-ordering is particularly beneficial when you anticipate that a specific column will be frequently used in query predicates. 
  • This technique is most effective when applied to columns with high cardinality, meaning they have a large number of distinct values. 
  1. Multiple Columns in Z-Ordering: 
  • While you can specify multiple columns for Z-Ordering as a comma-separated list, it’s important to note that the effectiveness of co-locality decreases with each additional column included. 
  1. Importance of Column Statistics: 
  • Z-ordering relies on column-local statistics such as minimum, maximum, and count to function effectively. 
  • Attempting to use Z-ordering on columns that do not have collected statistics would be ineffective and could consume unnecessary resources. 
  1. Configuring Statistics Collection: 

To optimize the effectiveness of Z-ordering, you have two options for configuring statistics collection: 

  • Reordering columns within the schema to affect which columns statistics are collected for. 
  • Increasing the number of columns for which statistics are collected. 
  1. Incremental Nature of Z-Ordering: 
  • Z-ordering is not an idempotent operation, but it aims to be an incremental one. 
  • The time required for Z-ordering is not guaranteed to decrease with multiple runs, but if no new data has been added to a partition that was recently Z-ordered, performing Z-ordering on that partition again will have no impact. 

In summary, Step 6 discusses the concept of Z-ordering, which improves query performance by co-locating related data in the same files. You should consider using Z-ordering for columns with high cardinality that are frequently used in query predicates. It’s essential to have statistics collected on the relevant columns, and you can adjust statistics collection by reordering columns or increasing the number of columns with statistics. Additionally, Z-ordering is not an idempotent operation and should be executed with consideration of its potential impact on query performance. 

 

Step 7: Vacuum 

  1. Cleaning Unnecessary Data: 
  • The VACUUM operation is responsible for removing files from a Delta table directory that are not managed by Delta, along with data files that are no longer in the latest state of the transaction log for the table and exceed a specified retention threshold. 
  1. Scanning Valid Partitions: 
  • During the VACUUM process, it scans all valid partitions within the target Delta table to identify and clean unnecessary files. 
  1. Retention Based on Logical Removal: 
  • It’s important to note that Delta table data files are deleted based on the time they have been logically removed from Delta’s transaction log, plus any retention hours specified, rather than their modification timestamps on the storage system. 
  1. Default Retention Period: 
  • By default, the retention threshold is set to 7 days, meaning files that are older than this period, considering the logical removal time, will be removed during a VACUUM operation. 
  1. Manual VACUUM Operation: 
  • It’s worth mentioning that on Delta tables, Azure Databricks does not trigger VACUUM operations automatically. They need to be initiated manually as needed. 
  1. Impact on Time Travel: 
  • Running VACUUM on a Delta table has a consequence: it prevents you from time traveling back to a version of the table that is older than the specified data retention period. 
  1. VACUUM Syntax: 
  • The syntax for initiating a VACUUM operation is as follows: 

VACUUM table_name [RETAIN num HOURS] [DRY RUN]  

In summary, Step 7 introduces the VACUUM operation, which is used to clean unnecessary files and manage data retention in a Delta table. VACUUM scans valid partitions, removes files based on logical removal time and retention settings, and should be run manually as needed. It’s essential to understand the impact of VACUUM on time travel capabilities, as well as the syntax for executing this operation. Finally, the document provides common table properties for different table load strategies, such as Append, Overwrite, or Merge, to further optimize Delta tables. 


sakhib.rahil

Leave a Reply

Your email address will not be published. Required fields are marked *