In the world of data integration and analytics, staying up to date with changes in your data is crucial. Change Data Capture (CDC) is a technology that enables you to identify and capture changes made to your data sources in real-time or near real-time. This capability is invaluable for maintaining data accuracy, improving data warehousing processes, and supporting various analytical tasks. In this blog post, we’ll delve into Change Data Capture in the context of Azure Data Factory, Microsoft’s cloud-based data integration service.
What is Change Data Capture (CDC)?
Change Data Capture is a method used to identify and track changes made to data in a source system, such as a database, so that these changes can be propagated to a target system, like a data warehouse or reporting database. CDC helps ensure that your target system stays synchronized with the source data, making it a fundamental component of data integration and data warehousing strategies.
Why Use Change Data Capture?
CDC offers several benefits, including:
Setting up Change Data Capture:
Here’s a general overview of the process:
a. Enable CDC at the Database Level: In most relational database systems like SQL Server or Azure SQL Database, CDC is typically enabled at the database level. You would need appropriate database permissions to enable CDC. Enabling CDC at the database level sets the stage for capturing changes in all the tables within that database.
To enable CDC is enabled on database:
Exec Sys.Sp_cdc_enable_db
b. Enable CDC at the Table Level: After enabling CDC at the database level, you can choose specific tables within that database for which you want to capture changes. If you enable CDC at the table level, you can have granular control over which tables and columns you want to track changes for.
To check CDC enabled on Table:
Select name, is _cdc_enabled from sys.databases
To enable CDC is enabled on Table:
Exec Sys.Sp_cdc_enable_table
@source_schema = ‘[SchemaName]’,
@source_name = ‘[TableName]’,
@role_name = NULL,
@supports_net_changes = 1
2. Configure CDC in ADF: Once CDC is enabled at the source database, you can configure your ADF pipeline to leverage CDC. This involves setting up the source and destination datasets in ADF, specifying the CDC configuration (such as watermark columns and tracking changes), and creating data integration pipelines to synchronize the changed data.
Recently, a newly introduced directory has emerged within Factory Resources, and it goes by the name of “Change Data Capture (preview)”.
Note: The Change Data Capture is in preview state.
Within that folder, click on the three dots and proceed to select “New Mapping (preview)”. On the initial screen, you’ll be tasked with defining the source.
Here, you have the choice to either select an existing linked service or create a new one. Your available options encompass CosmosDB, SQL Server, or a file source. After
making your selection, indicate the table(s) you wish to synchronize and then proceed by clicking “Continue”.
There are many Source type option enabled in CDC such as:
On the target screen, make your selection for the target type and a linked service. In my scenario, I will be writing data to Parquet files, which means I’ll have to designate a container within Azure Data Lake where these files will be stored.
There are many Source type option enabled in CDC such as:
A fresh mapping entity has been generated. You have the option to modify the field mappings and include additional tables. Importantly, upon its creation, the mapping job remains inactive. To initiate the mapping operation, you must initially publish it and subsequently click on the “Start” button.
To monitor the CDC mapping, we can go to the monitor part of ADF, and click on change data capture, and then on our newly created mapping.
Checking the Azure data lake folder, the data has been written to the target folder.
Each run creates a new file.
Benefits of using Change Data Capture in ADF
Using the new CDC capabilities in ADF offers several benefits, including:
anamika.sar