Exploring Change Data Capture in Azure Data Factory

Blogs

SAP CDC Connector in Azure Data Factory
September 20, 2023
Investigating Identity Column Jumps in SQL Server 2019
November 5, 2023

Exploring Change Data Capture in Azure Data Factory

Exploring Change Data Capture in Azure Data Factory

In the world of data integration and analytics, staying up to date with changes in your data is crucial. Change Data Capture (CDC) is a technology that enables you to identify and capture changes made to your data sources in real-time or near real-time. This capability is invaluable for maintaining data accuracy, improving data warehousing processes, and supporting various analytical tasks. In this blog post, we’ll delve into Change Data Capture in the context of Azure Data Factory, Microsoft’s cloud-based data integration service.

What is Change Data Capture (CDC)?

Change Data Capture is a method used to identify and track changes made to data in a source system, such as a database, so that these changes can be propagated to a target system, like a data warehouse or reporting database. CDC helps ensure that your target system stays synchronized with the source data, making it a fundamental component of data integration and data warehousing strategies.

 

Why Use Change Data Capture?

CDC offers several benefits, including:

  1. Real-time Data Updates: CDC allows you to keep your data warehouse or data lake up to date in near-real-time, ensuring that your analytics and reporting reflect the latest changes in your source systems.
  2. Reduced ETL Processing: Traditional Extract, Transform, Load (ETL) processes can be resource-intensive and time-consuming. CDC reduces the need for full data reloads and simplifies data synchronization, saving time and resources.
  3. Improved Data Quality: By capturing changes as they occur, CDC helps maintain data accuracy and consistency, reducing the risk of data errors and inconsistencies.
  4. Faster Decision-Making: Access to the most recent data enables faster and more informed decision-making, which is crucial in today’s competitive business environment.

Setting up Change Data Capture:

  1. To enable Change Data Capture (CDC) in Azure Data Factory (ADF), you primarily need to enable CDC at the database level of your source system, and in some cases, at the table level, depending on your source database technology.

Here’s a general overview of the process:

a. Enable CDC at the Database Level: In most relational database systems like SQL Server or Azure SQL Database, CDC is typically enabled at the database level. You would need appropriate database permissions to enable CDC. Enabling CDC at the database level sets the stage for capturing changes in all the tables within that database.

To enable CDC is enabled on database:

Exec Sys.Sp_cdc_enable_db

b. Enable CDC at the Table Level: After enabling CDC at the database level, you can choose specific tables within that database for which you want to capture changes. If you enable CDC at the table level, you can have granular control over which tables and columns you want to track changes for.

To check CDC enabled on Table:

Select name, is _cdc_enabled from sys.databases

To enable CDC is enabled on Table:

Exec Sys.Sp_cdc_enable_table

@source_schema = ‘[SchemaName]’,

@source_name = ‘[TableName]’,

@role_name = NULL,

@supports_net_changes = 1

2. Configure CDC in ADF: Once CDC is enabled at the source database, you can configure your ADF pipeline to leverage CDC. This involves setting up the source and destination datasets in ADF, specifying the CDC configuration (such as watermark columns and tracking changes), and creating data integration pipelines to synchronize the changed data.

Recently, a newly introduced directory has emerged within Factory Resources, and it goes by the name of “Change Data Capture (preview)”.

Note: The Change Data Capture is in preview state.

Within that folder, click on the three dots and proceed to select “New Mapping (preview)”. On the initial screen, you’ll be tasked with defining the source.

Here, you have the choice to either select an existing linked service or create a new one. Your available options encompass CosmosDB, SQL Server, or a file source. After

making your selection, indicate the table(s) you wish to synchronize and then proceed by clicking “Continue”.

There are many Source type option enabled in CDC such as:

    • Azure CosmosDB for NoSQL
    • Azure SQL Database
    • Azure SQL Database Managed Instance
    • SQL Server
    • Snowflake
    • Avro
    • DelimitedText
    • JSON
    • ORC
    • Parquet
    • XML

On the target screen, make your selection for the target type and a linked service. In my scenario, I will be writing data to Parquet files, which means I’ll have to designate a container within Azure Data Lake where these files will be stored.

There are many Source type option enabled in CDC such as:

    • Azure SQL Database
    • Azure SQL Database Managed Instance
    • Azure Synapse Analytics
    • Avro
    • DelimitedText
    • JSON
    • ORC
    • Parquet
    • Delta

 

A fresh mapping entity has been generated. You have the option to modify the field mappings and include additional tables. Importantly, upon its creation, the mapping job remains inactive. To initiate the mapping operation, you must initially publish it and subsequently click on the “Start” button.

 

To monitor the CDC mapping, we can go to the monitor part of ADF, and click on change data capture, and then on our newly created mapping.

Checking the Azure data lake folder, the data has been written to the target folder.

Each run creates a new file.

Benefits of using Change Data Capture in ADF

Using the new CDC capabilities in ADF offers several benefits, including:

  1. Improved efficiency: CDC allows you to process only the changes in your data, reducing the amount of processing required and improving the overall efficiency of your data pipelines.
  2. Simplified setup: The new CDC resource simplifies the process of configuring and managing CDC workflows, making it easier for users to implement and maintain their data pipelines.
  3. Cost-effective: As you are billed only for the CDC process and not for the pipelines, using the new CDC feature can be a more economical solution for processing your data.

 

 

 

 

 

 

 

 

 

 


anamika.sar

Leave a Reply

Your email address will not be published. Required fields are marked *