Data Flow in Azure Data Factory: A Deep Dive into Mapping and Wrangling Data Flows

Blogs

Microsoft SQL Server Execution Plan Analysis
September 22, 2024
Introduction to Retrieval-Augmented Generation
September 24, 2024

Data Flow in Azure Data Factory: A Deep Dive into Mapping and Wrangling Data Flows

Azure Data Factory (ADF) is a cloud-based data integration service that enables users to create, schedule, and orchestrate data workflows. Azure Data Factory offers two main types of data flows: Mapping Data Flow and Wrangling Data Flow.

Data Flow is a data transformation tool that allows users to design, build, and execute data transformation logic using a drag-and-drop interface. Built on Apache Spark, a high performance, scalable data processing engine, it enables you to process large volumes of data quickly. This integration with Spark ensures both speed and scalability for complex data transformation tasks.

ADF data flow capabilities
ADF provides various data flow capabilities to handle different data transformations

  1. Mapping data flow
    Mapping Data Flows in Azure Data Factory help you handle complex data transformations. They offer a visual interface where you can create and manage these transformations using a series of activities. With a drag-and-drop interface, you can design, build, and execute data transformations easily. The data flow pipeline is made up of components like sources, transformations, and sinks, which are used to extract, transform, and load data.

          Key Features

  • Data Transformation Activities: Supports a wide range of activities including filtering, aggregating, joining, sorting, and pivoting data.
  • Scalability: Data flows are executed on spark scalable compute.
  • Debugging in Mapping Data Flow allows you to step through the data flow pipeline and inspect the data at each stage. This process is valuable for identifying and resolving issues before executing the pipeline.
  • Expressions: Expressions in Mapping Data Flow allow users to create dynamic data transformation logic that can be applied to data as it flows through the pipeline.

          Limitations

  • Performance Bottlenecks: Potential bottlenecks include reading from a source, transformation time, and writing to a sink.
  • Complex Transformations: Handling complex transformations might require additional optimization and tuning.
  • Limited Support for Certain Fields: Mapping to polymorphic lookup fields, multi-level lookup fields, and certain status fields is not supported.
  1. Wrangling data flow
    Wrangling Data Flows are designed for data preparation and transformation tasks that involve cleansing and reshaping data. They are based on the Power Query experience, which is familiar to users who have worked with data preparation tools like Power BI.
    Wrangling Data Flows use the M query language and the Power Query Editor interface from Power BI Desktop.

          Key Features

  • Interactive Interface: Provides an interactive, user-friendly interface for data wrangling, like Power Query in Power BI.
  • Data Transformation Functions: Includes a broad set of data transformation functions such as text manipulation, date/time operations, and data type conversions.
  • Exploratory Data Preparation: Perfect for tasks that require a manual, hands-on approach to exploring and preparing data.

          Limitations

  • Scalability and Performance: Handling large datasets can slow down the wrangling process, requiring careful planning and optimization to ensure efficiency and scalability.
  • Handling Unstructured and Semi-Structured Data: Processing unstructured data (like text and images) and semi-structured data (like JSON) adds complexity, needing specialized tools and techniques.

Comparing Mapping and Wrangling data flows

  1. Purpose

Mapping data flow: Transform data, primarily used for transforming data with well-defined schemas  and structures.
Wrangling data flow: Designed for data preparation and exploratory data analysis. Useful for cleansing, reshaping, and transforming data.

  1. Design Interface

Mapping data flow: Visual designer, utilizes a drag-and-drop interface to design transformations. You can connect various components like sources, transformations, and sinks to create a data pipeline.
Wrangling data flow: Based on the Power Query experience, familiar to users of tools like Power BI. Provides a more exploratory and interactive approach to data preparation.

  1. Schema drift

Mapping Data Flows: Automated schema drift handling, making them suitable for scenarios where source schemas are frequently changing.
Wrangling Data Flows: Require manual updates for schema changes, which is more appropriate for cases where schema drift is less frequent.

  1. Performance

Mapping data flow: Leverages Apache Spark for high performance and scalability, allowing for processing of large volumes of data efficiently.
Wrangling data flows: Utilizes the M query language for data  transformations, providing a robust environment for data manipulation but less focused on the large-scale performance optimizations found in Mapping data flows.

  1. Analyzing the icons

Mapping data flow Icon: Represents the process of transforming raw data into structured forms, reflecting its role in complex data transformations.
Wrangling data flows Icon: Represents the process of preparing and reshaping datasets, highlighting its role in data exploration and cleanup.

MAPPING DATA FLOW STEPS

Step 1: In the Azure Data Factory pane, create a new mapping data flow.

Step 2: To add sources, create a dataset and configure the linked service. Specify the table names directly when creating the dataset or use parameters to pass them dynamically.
In the following example, two tables, Employees and Projects, are used. Two sources are added in the data flow, one for each table.

Step 3: To perform a transformation on a specific table, click the ‘+’ sign next to that table. This will allow the select the desired transformation.
In this example, a left join is performed on the Employees and Projects table.

Step 4: Add a sink to load the transformed data into the final table.

Step 5: To execute this mapping data flow, navigate to the author section in ADF and create a pipeline. Within the pipeline, use the data flow activity and select the created data flow to execute it. This data flow activity can be used in conjunction with other activities in ADF.

WRANGLING DATA FLOW STEPS

Step 1: In the Azure portal, navigate to the data factory resources. Click on the author option and select Power Query from the available options to create a new wrangling data flow.

Step 2: Specify the source dataset for the query. Supported sources include Azure Blob Storage, Azure Data Lake Storage Gen2, Azure SQL Database, and Azure Synapse Analytics.
In the example shown in the screenshot, Azure Data Lake Storage Gen2 is selected as the source dataset, with a sample CSV file chosen.

Step 3: Perform transformations such as removing unnecessary rows and columns, keeping only the required rows, grouping data, splitting columns, removing duplicate records, merging queries, appending queries, and more.

In this example, the “Strengths” column is removed from the source dataset.

After completing the transformation, execute the Power Query using a Power Query activity. Navigate to the data pipeline menu by clicking on the pipeline icon.
Then, select the Power Query table from the Power Query activity.

Loading data into a table

Conclusion

Azure Data Factory’s Data Flows feature makes it easy to create data transformation pipelines. One of the key benefits is its simplicity, users can build pipelines without needing to write complicated code. Dataflows can handle a wide range of tasks, from cleaning and filtering data to performing complex aggregations and joins, making it a great tool for organizations that need to work with large datasets.

Overall, Data Flows is a valuable addition to any data integration or ETL workflow. It helps organizations format their data for analysis and reporting. With its user-friendly design and support for various transformations, Data Flows can simplify data processing and improve the accuracy and reliability of the results.


Chandana R L

Leave a Reply

Your email address will not be published. Required fields are marked *