Understanding Data Source Integration in Azure Data Factory

Blogs

Database Recovery 101: Handling a Corrupted Log File
September 26, 2024
Scalar Functions in SQL: Balancing Convenience with Performance
September 29, 2024

Understanding Data Source Integration in Azure Data Factory

Data source integration involves the seamless transfer of data from one system to another. This process ensures that data is efficiently and securely moved, maintaining its integrity and usability.

Below is an overview of the different categories of data sources supported by ADF

  • On-Premises Data Sources: These are data sources located within an organization’s local infrastructure. ADF allows connectivity to a variety of on-premises databases and file systems, including SQL Server, Oracle, Teradata, IBM DB2, MySQL, PostgreSQL, File systems (local files, network file shares)
  • Cloud Data Sources: These sources are hosted on cloud platforms, providing flexibility and scalability for data storage and processing. ADF supports a range of cloud data services such as Azure Blob Storage, Azure Data Lake Storage (Gen1 and Gen2), Azure SQL Database, Azure Cosmos DB, Azure Synapse Analytics, Amazon S3, Google Cloud Storage.
  • SaaS Data Sources: Software as a Service applications are cloud-based software solutions that are accessed via the internet. ADF provides connectors for various SaaS platforms, making it easy to ingest data from them including Salesforce, Dynamics 365, ServiceNow, Zendesk, Adobe Analytics.
  • Other Data Sources: Other data sources in Azure Data Factory include REST APIs, ODBC and OLE DB sources, and various NoSQL databases like MongoDB. ADF enables seamless integration with these sources, allowing organizations to retrieve and send data from diverse platforms. This flexibility supports real-time analytics, legacy system connections, and the handling of unstructured data.

In our scenario, the source is an SQL database, and the destination would be cloud platform, which is Azure SQL database.

During the copy activity, the source dataset and the Integration Runtime must be specified. For on-premises data, the Self-hosted Integration Runtime (SHIR) should be selected, and the table to be migrated from the SQL database to the Azure cloud must be chosen.

The Self-Hosted Integration Runtime in Azure Data Factory serves as a secure bridge for data transfer between cloud and on-premises environments. It needs to be installed on an on-premises machine or a virtual machine within a private network, ensuring that data can be securely transferred without exposing internal resources directly to the internet.

In this example, the employees table is selected as a source for migration.

Source Dataset: The data source in this scenario is SQL server table named ‘Employees’. It includes the details of Employee like EmployeeID, Firstname, Lastname, Position and Salary.

Use Query: In this example, the table option is selected in Use Query. Besides this, there are other options available:

  • Query: To specify which columns to select or to apply filters, choose the Query option under Use Query.
  • Stored Procedure: For using a stored procedure, select the Stored Procedure option and provide its name.
    These options allow for flexible data selection and filtering during the copy activity.

Isolation Level: The isolation level for the copy activity can significantly impact performance and data consistency. The isolation level determines how data is read from the source system during the copy process.

  • Read Uncommitted: This level provides the fastest performance as it allows reading data that might not yet be committed. However, it can lead to dirty reads, where uncommitted changes are read.
  • Read Committed: This level ensures that only committed data is read, preventing dirty reads. It is a balance between performance and data consistency.
  • Repeatable Read: This level ensures that if a row is read twice in the same transaction, it will not change. It prevents non-repeatable reads but can lead to higher locking and reduced performance.
  • Serializable: This is the strictest level, ensuring complete isolation from other transactions. It prevents dirty reads, non-repeatable reads, and phantom reads but can significantly impact performance due to extensive locking.
  • Snapshot:  Snapshot isolation level in Azure Data Factory’s copy activity ensures consistent reads by maintaining a version of the data as it was at the start of the transaction. This minimizes the impact of concurrent transactions and reduces locking issues.

To specify the destination, create the sink dataset for Azure SQL database

Write behavior: In this example, the write behavior is set to insert.

The write behavior can be set to insertupsert, or stored procedure depending on the scenario.

  • Insert: This writes behavior adds new records to the destination table. If the table already contains data, the new records are appended without altering the existing data.
  • Upsert: This combines the operations of updating and inserting. If a record already exists in the destination table (based on a specified key), it is updated. If it does not exist, a new record is inserted. This is useful for maintaining data consistency.
  • Stored Procedure: This option allows the use of a stored procedure to handle the data insertion. The stored procedure can include custom logic for inserting, updating, or transforming data as needed.

Table option: For table behavior, there are two options available. In this example, the auto-create table option is used.

  • Use Existing: This option uses an existing table in the destination database.
  • Auto Create Table: This option automatically creates the table in the destination database if it does not already exist.

Pre-copy script: Pre-copy script is a SQL statement that runs before the data copy activity begins. It can be used for tasks like truncating a table or setting up conditions.

Pipeline Execution: Migrating data from On-premises SQL Server to Azure SQL database.

Conclusion

Azure Data Factory provides extensive support for a wide array of data sources, including on-premises databases, cloud storage solutions, SaaS applications, and various other data technologies. This versatility allows organizations to seamlessly integrate and ingest data from diverse environments, facilitating comprehensive data pipelines. By leveraging ADF’s capabilities, businesses can enhance their analytics and reporting processes, drive informed decision-making, and optimize their overall data management strategies. Ultimately, ADF empowers organizations to unify their data landscape, enabling deeper insights and greater operational efficiency.

 


Chandana R L

Leave a Reply

Your email address will not be published. Required fields are marked *