Data Validation- Using ADF for Data Migration

Blogs

Embracing Transitioning: Migration from Qlik to Power BI – Part 1
July 24, 2023
Auto emailer using python
July 27, 2023

Data Validation- Using ADF for Data Migration

DATA VALIDATION USING ADF PIPELINE

 

– MANAN CHOUDHARY

 

ADF is one the most significant tool used in the industry to perform ETL operations.

One of the most prominent areas while performing data migration from source to destination requires a full-proof check in terms of data being transferred correctly.

In this blog, I will be sharing the ways to perform a data validation using two major metrics, namely:-

  • Row Count
  • Checksum

 

One of the projects I worked on for a client required that required migration of the data from mysql5.6 to 5.7. I was handling the task to validate the data.

I used the ADF approach to build a pipeline that checks the row count & checksum for all tables migrated from source to destination.

ADF is an ETL used to perform the required transformation, joins, or operations required based on the business logic. The client wanted to make sure the data integrity is maintained and during data migration, it is very vital to make sure the data is consistent in both source and destination.

The blog contains the step-by-step process to create an ADF pipeline to validate the data.

Now, let’s start with the formation to get the row count for both source and destination.

Fig:1 Entire Pipeline for Row Count

We can start by dragging Lookup Activity to the canvas in ADF and configure the same with the script of SQL containing the list of table names and database names. We can achieve the same using the metadata table available in your SQL server.

Note: Do not provide the name of the table in your dataset, until and unless you have created a metadata table explicitly by yourself.

 

Fig2: Configuration for lookup-activity dataset

 

Next, we can bring another activity to the canvas , named-“For each ” activity. Connect the “success” arrow of lookup to the For-each activity.

Configure the setting of For-each activity as :-

Fig 3: Items Setting Configuration in Look-up Activity

Now add “Copy” activity inside the for-each activity. The configuration of the copy activity is described below:-

For source:

  • Create a new linked service and configure it to the source sql server and parameterize the database name in the linked service.
  • Now, create a new dataset for this and create two parameters for the datasets, and configure these parameters in the place of database name(parameter of linked service) and table name.

This is be easily understood with the pictorial representation below:-

Fig 4: Configuration for linked service for source-(Data_validation_5_7_source)

Fig 5: Configuration for dataset linked to the source with parameters)

Now let’s select the source dataset option as our newly configured dataset in the settings, and provide the parameters for it as item() linked from output value of lookup activity.

Fig 6 : Configuration for source in Copy Activity

 

 

One final thing to note here is that we shall use query option to make it a dynamic query for the row count.

 

Fig 7: Pipeline expression for “Query” to be used in source setting

 

 

 

The last step left is to create the configuration for the sink, we can do the same as we did for the source, but if we want to copy the row counts of data along with database name and table names in a separate local SQL server then we just need to create a linked service addressing that SQL server and creating sink dataset aiming to that particular table. This step is depicted below as:-

Fig 8 : Configuration for sink setting in copy activity

 

 

Hence, our final ADF pipeline to get the row count for the source is ready. Similarly, we can create one more pipeline for the destination, and sink could be in a local sql server with a new table with the same name of columns as mentioned in Fig-7.

 

Our final pipeline for both source and destination would look exactly as Fig-1.

 

 

Checksum Pipeline:-

 

The checksum is a very crucial part of data validation, since, we can also build the pipeline for the same using the above steps mentioned line by line.

 

The only difference is in the query option in the copy-data activity, where we perform checksum for each database table mentioned in the lookup activity.

 

The change is shown in the picture below:-

Fig 9 : Settings for the source

Fig 10 : Configuration for query setting in source option of copy activity

 

 

 

Conclusion:-

This is one of the ways to automate the process of finding the row count and checksum for data validation.

Hope this blog helped you find the relevant guidance to customize and create your pipeline for data validation in ADF.

 


manan.choudhary

Leave a Reply

Your email address will not be published. Required fields are marked *