– MANAN CHOUDHARY
ADF is one the most significant tool used in the industry to perform ETL operations.
One of the most prominent areas while performing data migration from source to destination requires a full-proof check in terms of data being transferred correctly.
In this blog, I will be sharing the ways to perform a data validation using two major metrics, namely:-
One of the projects I worked on for a client required that required migration of the data from mysql5.6 to 5.7. I was handling the task to validate the data.
I used the ADF approach to build a pipeline that checks the row count & checksum for all tables migrated from source to destination.
ADF is an ETL used to perform the required transformation, joins, or operations required based on the business logic. The client wanted to make sure the data integrity is maintained and during data migration, it is very vital to make sure the data is consistent in both source and destination.
The blog contains the step-by-step process to create an ADF pipeline to validate the data.
Now, let’s start with the formation to get the row count for both source and destination.
Fig:1 Entire Pipeline for Row Count
We can start by dragging Lookup Activity to the canvas in ADF and configure the same with the script of SQL containing the list of table names and database names. We can achieve the same using the metadata table available in your SQL server.
Note: Do not provide the name of the table in your dataset, until and unless you have created a metadata table explicitly by yourself.
Fig2: Configuration for lookup-activity dataset
Next, we can bring another activity to the canvas , named-“For each ” activity. Connect the “success” arrow of lookup to the For-each activity.
Configure the setting of For-each activity as :-
Fig 3: Items Setting Configuration in Look-up Activity
Now add “Copy” activity inside the for-each activity. The configuration of the copy activity is described below:-
For source:
This is be easily understood with the pictorial representation below:-
Fig 4: Configuration for linked service for source-(Data_validation_5_7_source)
Fig 5: Configuration for dataset linked to the source with parameters)
Now let’s select the source dataset option as our newly configured dataset in the settings, and provide the parameters for it as item() linked from output value of lookup activity.
Fig 6 : Configuration for source in Copy Activity
One final thing to note here is that we shall use query option to make it a dynamic query for the row count.
Fig 7: Pipeline expression for “Query” to be used in source setting
The last step left is to create the configuration for the sink, we can do the same as we did for the source, but if we want to copy the row counts of data along with database name and table names in a separate local SQL server then we just need to create a linked service addressing that SQL server and creating sink dataset aiming to that particular table. This step is depicted below as:-
Fig 8 : Configuration for sink setting in copy activity
Hence, our final ADF pipeline to get the row count for the source is ready. Similarly, we can create one more pipeline for the destination, and sink could be in a local sql server with a new table with the same name of columns as mentioned in Fig-7.
Our final pipeline for both source and destination would look exactly as Fig-1.
Checksum Pipeline:-
The checksum is a very crucial part of data validation, since, we can also build the pipeline for the same using the above steps mentioned line by line.
The only difference is in the query option in the copy-data activity, where we perform checksum for each database table mentioned in the lookup activity.
The change is shown in the picture below:-
Fig 9 : Settings for the source
Fig 10 : Configuration for query setting in source option of copy activity
This is one of the ways to automate the process of finding the row count and checksum for data validation.
Hope this blog helped you find the relevant guidance to customize and create your pipeline for data validation in ADF.
manan.choudhary