ADW TO DELTA MIGRATION

Blogs

Investigating Identity Column Jumps in SQL Server 2019
November 5, 2023
Best Practices to follow for a Delta Table Creation
November 5, 2023

ADW TO DELTA MIGRATION

Myntra’s decision to transition from Azure Data Warehouse (ADW) to a Lakehouse architecture was driven by several key factors that highlighted the limitations of their previous data warehousing approach. To gain a deeper understanding of why this shift was necessary, it’s essential to examine how extensively data was being utilized: 

  1. Multiple Data Sources: Myntra was ingesting data from various sources, including MySQL, Google Sheets, Hive tables, and storage accounts. 
  1. Diverse File Formats: Data was stored in a variety of formats such as Parquet, ORC, CSV, and others in storage accounts. This data was regularly loaded into ADW, or external tables were created to access it. 
  1. Big Data: As an e-commerce company, Myntra had to capture and store a vast amount of data to operate, maintain, and enhance its product. Even small interactions, such as clicking on a product icon, generated data, resulting in continually expanding data volumes. 
  1. Bronze, Silver, and Gold Tables: Data was categorized into three primary types: bronze (raw, unvalidated data), silver (validated and enriched data), and gold (refined and aggregated data for analytics and machine learning). 
  1. Multiple End Users and Applications: Numerous teams and users had access to ADW, resulting in concurrent queries and multiple jobs to create aggregates, support data visualization, and extract data for dashboard creation. 
  1. Multiple Platforms for Data Visualization: A significant amount of data stored in ADW was also replicated in Hive, as many users and teams used Trino as a query engine. Access to ADW was restricted for some, necessitating the use of alternative platforms. 
  1. Table Duplication: Some tables were duplicated across multiple databases to facilitate various aggregations, as ADW treated databases as separate instances. 

These characteristics of the data warehousing setup led to several real-time challenges for the organization: 

  1. Data Duplication: Data from multiple sources was duplicated across different platforms, increasing storage costs and data management complexity. 
  1. Query Locks: Ingesting data from multiple sources into a single cluster caused query locks, hindering concurrent operations and potentially impacting data access. 
  1. Cost Escalation: Growing data volumes resulted in increased warehousing and compute costs. 
  1. Maintenance Issues: Poorly executed maintenance activities led to delays in query execution, job processing, data ingestion, and extraction. 
  1. Cluster Performance: The increase in data size impacted cluster performance, leading to delays in critical table population and resource contention. 
  1. Resource Dependency: The organization’s workloads relied heavily on a single cluster, causing resource bottlenecks and system failures. 
  1. Cluster Stability: The single-cluster approach occasionally led to cluster failures during data operations, ingestion, and aggregation. 
  1. Schema Complexity: Multiple tables with similar data were duplicated across different databases, further complicating the schema. 
  1. External Dependencies: Data present in storage accounts in various formats introduced external dependencies, potentially causing query slowdowns. 
  1. Maintenance Time: As data grew, the time required for warehouse optimization and maintenance activities also increased, causing delays in other scheduled jobs. 
  1. Query Planning: Building optimal query plans became time-consuming with increased data, aggregations, and users. 

To address these performance and scalability issues, the organization adopted a new approach: 

  1. Decoupling Storage and Compute: Recognizing that a significant portion of ADW’s compute was used for Stored Procedures to generate silver and gold tables, they began moving some processing tasks to Spark, utilizing data from storage accounts and aggregating it on Spark clusters before pushing it to ADW. 
  1. Delta Tables Implementation: To eliminate data duplication and reduce unnecessary compute costs, they introduced Delta tables, an open-source storage layer that provided ACID transactions, scalable metadata handling, and compatibility with Apache Spark APIs. 

Delta tables addressed many of the organization’s challenges, including data source diversity, duplication, decoupling of storage and compute, multi-cluster workloads, and unified access for Hive and Delta tables. This transition allowed Myntra to optimize their data architecture for improved performance, cost efficiency, and scalability. 

 

The adoption of Delta tables offered several notable benefits to Myntra’s data management: 

  1. ACID Transactions: Delta tables provided support for ACID transactions, ensuring data consistency and preventing inconsistencies during concurrent operations. 
  1. Scalable Metadata Handling: Myntra could efficiently manage metadata for large-scale tables with billions of files thanks to Delta Lake’s utilization of Spark’s distributed processing capabilities. 
  1. Unified Streaming and Batch Processing: Delta Lake seamlessly integrated batch and streaming data processing, simplifying data ingest, historic backfill, and interactive queries. 
  1. Schema Enforcement: Delta tables automatically handled schema variations, ensuring that only valid records were inserted during data ingestion. 
  1. Time Travel: The ability to access historical data versions allowed for rollbacks, comprehensive audit trails, and reproducible machine learning experiments. 
  1. Upserts and Deletes: Support for merge, update, and delete operations enabled complex use cases like change-data-capture, slowly-changing-dimension (SCD) operations, and streaming upserts. 

The choice of Delta tables was influenced by several key considerations: 

  1. Transactional Storage Format: Delta format is built on top of Apache Parquet, offering transactional features like ACID transactions, schema enforcement, and data versioning. 
  1. Concurrency and Versioning: Delta format was ideal for use cases requiring high concurrency, frequent updates, and the ability to query data at different points in time. 
  1. Data Lake Compatibility: Delta format suited data lake environments where data was ingested from multiple sources and processed using various ETL and analytics tools. 

How Delta Works: 

Delta Lake acts as an intermediary layer between runtimes like Spark, Trino, and cloud storage in existing data lake setups. Underlying data in cloud storage is stored in Parquet format. Delta Lake generates delta logs for each committed transaction, with delta files stored as JSON. These delta logs contain information about operations, snapshots, and data statistics. Delta files, sequentially named JSON files, collectively form a log of all changes to a table. 

By adopting Delta tables and transitioning to a Lakehouse architecture, Myntra was able to address the challenges posed by their previous data warehousing setup. They achieved better data management, improved performance, reduced data duplication, and enhanced scalability while leveraging the power of Delta Lake on Azure Databricks. 

This strategic move allowed Myntra to streamline their data architecture, making it more efficient, cost-effective, and adaptable to their evolving data needs, ultimately empowering the organization to make data-driven decisions and achieve business objectives more effectively. 


sakhib.rahil

Leave a Reply

Your email address will not be published. Required fields are marked *