Unlocking Big Data Power: Apache Spark and Microsoft Fabric for Scalable Data Processing

Blogs

Leveraging SHORTCUTS for Data Management in Microsoft Fabric
September 9, 2024
Active vs Inactive Relationships in Power BI: What We Need to Know
September 9, 2024

Unlocking Big Data Power: Apache Spark and Microsoft Fabric for Scalable Data Processing

Apache Spark is an open-source parallel processing framework for large-scale data processing and analytics. Spark has become popular in big data processing scenarios, and is available in multiple platform implementations, including Microsoft Fabric. Apache Spark is a distributed data processing framework that enables large-scale data analytics by coordinating work across multiple processing nodes in a cluster, known in Microsoft Fabric as a Spark pool. Put more simply, Spark uses a “divide and conquer” approach to processing large volumes of data quickly by distributing the work across multiple computers. The process of distributing tasks and combining results is handled for us by Spark.

The use of Spark in Microsoft Fabric is fully managed and abstracted. This means that the complexity of spinning up a Spark instance is managed behind the scenes. We have the ability to control some of the configurations and settings, but most of the hard work is done by Fabric. Spark compute is used specifically for the data engineering and data science experience in Fabric, and specifically is in use when we run a Notebook or a Spark job Definition.

Spark can run code written in a wide range of languages, including Java, Scala, R, SQL (spark SQL) and Python (PySpark). In practice, most data engineering and analytics workloads are accomplished using a combination of PySpark and Spark SQL.

Spark pools

A Spark pool consists of compute nodes that distribute data processing tasks. The general architecture is shown in the following diagram.

A Spark pool contains two kinds of node:

  1. head node in a Spark pool coordinates distributed processes through a driver program.
  2. Multiple worker nodes on which executor processes perform the actual data processing tasks.

Spark pools in Microsoft Fabric

Microsoft Fabric provides a starter pool in each workspace, enabling Spark jobs to be started and run quickly with minimal setup and configuration. The starter pool can be configured for a specific workload. Additionally, custom Spark pools can be created with specific node configurations that support the data processing needs.

Starter pool can be managed in the Data Engineering/Science section of the workspace settings.

Specific configuration settings for Spark pools include:

  • Node Family: The type of virtual machines used for the Spark cluster nodes. In most cases, memory optimized nodes provide optimal performance.
  • Autoscale: To automatically provision nodes as needed, and if so, the initial and maximum number of nodes to be allocated to the pool.
  • Dynamic allocation: To dynamically allocate executor processes on the worker nodes based on data volumes.

If creating one or more custom Spark pools in a workspace, one of them (or the starter pool) can be set as the default pool to be used if a specific pool is not specified for a given Spark job.

Runtimes and Environments

In Data engineering of spark, different runtime versions determine which Apache Spark, Delta Lake, Python, and other core components are used. Variety of libraries can be added for both general and specialized tasks. Organizations often create multiple environments to handle different data processing needs. Each environment includes a specific runtime version and the necessary libraries. Data engineers and scientists can then choose the right environment for their tasks in a Spark pool.

Environments in Microsoft Fabric

Custom environments can be created in a Fabric workspace, enabling us to use specific Spark runtimes, libraries, and configuration settings for different data processing operations.

When creating an environment, we can:

  • Specify the Spark runtime it should use.
  • View the built-in libraries that are installed in every environment.
  • Install specific public libraries from the Python Package Index (PyPI).
  • Install custom libraries by uploading a package file.
  • Specify the Spark pool that the environment should use.

After creating at least one custom environment, we can specify it as the default environment in the workspace settings.

High concurrency mode

High concurrency mode enables users to share the same Spark sessions in Spark across data engineering and data science workloads. When using this mode, each notebook or job operates within its own isolated environment, even though they share the same Spark session. This means that multiple items can run concurrently without interfering with each other.

How does it work?

  • Security: Session sharing is always within a single user boundary.
  • Multitask: Users can seamlessly switch between notebooks and continue their work without experiencing any delays due to session creation.
  • Cost-effective: Allows users to achieve better resource utilization and cost savings for their data engineering/ science workloads

Session sharing conditions include:

  • Sessions should be within a single user boundary.
  • Sessions should have the same default Lakehouse configuration.
  • Sessions should have the same Spark compute properties.

CONCLUSION

Apache Spark, known for its powerful parallel processing capabilities, offers data processing solutions within Microsoft Fabric. By leveraging Spark pools, users can efficiently distribute tasks across multiple compute nodes, while the flexibility of multiple runtime versions and custom environments help in various data processing needs. The introduction of High-Concurrency Mode further enhances the efficiency and cost-effectiveness of Spark operations, allowing multiple concurrent workloads to run smoothly and securely within shared Spark sessions. Whether using starter or custom Spark pools, and leveraging specific environments, Microsoft Fabric provides a robust platform for tackling complex data challenges.


Nayan Sagar N K

Leave a Reply

Your email address will not be published. Required fields are marked *