Azure Data Factory (ADF) is a cloud-based data integration service that enables users to create, schedule, and orchestrate data workflows. Azure Data Factory offers two main types of data flows: Mapping Data Flow and Wrangling Data Flow.
Data Flow is a data transformation tool that allows users to design, build, and execute data transformation logic using a drag-and-drop interface. Built on Apache Spark, a high performance, scalable data processing engine, it enables you to process large volumes of data quickly. This integration with Spark ensures both speed and scalability for complex data transformation tasks.
ADF data flow capabilities
ADF provides various data flow capabilities to handle different data transformations
Key Features
Limitations
Key Features
Limitations
Comparing Mapping and Wrangling data flows
Mapping data flow: Transform data, primarily used for transforming data with well-defined schemas and structures.
Wrangling data flow: Designed for data preparation and exploratory data analysis. Useful for cleansing, reshaping, and transforming data.
Mapping data flow: Visual designer, utilizes a drag-and-drop interface to design transformations. You can connect various components like sources, transformations, and sinks to create a data pipeline.
Wrangling data flow: Based on the Power Query experience, familiar to users of tools like Power BI. Provides a more exploratory and interactive approach to data preparation.
Mapping Data Flows: Automated schema drift handling, making them suitable for scenarios where source schemas are frequently changing.
Wrangling Data Flows: Require manual updates for schema changes, which is more appropriate for cases where schema drift is less frequent.
Mapping data flow: Leverages Apache Spark for high performance and scalability, allowing for processing of large volumes of data efficiently.
Wrangling data flows: Utilizes the M query language for data transformations, providing a robust environment for data manipulation but less focused on the large-scale performance optimizations found in Mapping data flows.
Mapping data flow Icon: Represents the process of transforming raw data into structured forms, reflecting its role in complex data transformations.
Wrangling data flows Icon: Represents the process of preparing and reshaping datasets, highlighting its role in data exploration and cleanup.
MAPPING DATA FLOW STEPS
Step 1: In the Azure Data Factory pane, create a new mapping data flow.
Step 2: To add sources, create a dataset and configure the linked service. Specify the table names directly when creating the dataset or use parameters to pass them dynamically.
In the following example, two tables, Employees and Projects, are used. Two sources are added in the data flow, one for each table.
Step 3: To perform a transformation on a specific table, click the ‘+’ sign next to that table. This will allow the select the desired transformation.
In this example, a left join is performed on the Employees and Projects table.
Step 4: Add a sink to load the transformed data into the final table.
Step 5: To execute this mapping data flow, navigate to the author section in ADF and create a pipeline. Within the pipeline, use the data flow activity and select the created data flow to execute it. This data flow activity can be used in conjunction with other activities in ADF.
WRANGLING DATA FLOW STEPS
Step 1: In the Azure portal, navigate to the data factory resources. Click on the author option and select Power Query from the available options to create a new wrangling data flow.
Step 2: Specify the source dataset for the query. Supported sources include Azure Blob Storage, Azure Data Lake Storage Gen2, Azure SQL Database, and Azure Synapse Analytics.
In the example shown in the screenshot, Azure Data Lake Storage Gen2 is selected as the source dataset, with a sample CSV file chosen.
Step 3: Perform transformations such as removing unnecessary rows and columns, keeping only the required rows, grouping data, splitting columns, removing duplicate records, merging queries, appending queries, and more.
In this example, the “Strengths” column is removed from the source dataset.
After completing the transformation, execute the Power Query using a Power Query activity. Navigate to the data pipeline menu by clicking on the pipeline icon.
Then, select the Power Query table from the Power Query activity.
Loading data into a table
Conclusion
Azure Data Factory’s Data Flows feature makes it easy to create data transformation pipelines. One of the key benefits is its simplicity, users can build pipelines without needing to write complicated code. Dataflows can handle a wide range of tasks, from cleaning and filtering data to performing complex aggregations and joins, making it a great tool for organizations that need to work with large datasets.
Overall, Data Flows is a valuable addition to any data integration or ETL workflow. It helps organizations format their data for analysis and reporting. With its user-friendly design and support for various transformations, Data Flows can simplify data processing and improve the accuracy and reliability of the results.
Chandana R L