A Beginner’s Guide to Spark Compute in Microsoft Fabric: Starter and Custom Pools

Blogs

Using Query Store in SQL Server for Performance Analysis
September 11, 2024
Streamlining Data Security and Archival with Azure Storage
September 11, 2024

A Beginner’s Guide to Spark Compute in Microsoft Fabric: Starter and Custom Pools

We have already discussed the overview of Apache Spark in Microsoft fabric in this previous blog,
https://datasturdy.com/unlocking-big-data-power-apache-spark-and-microsoft-fabric-for-scalable-data-processing/

Spark Compute is a key component of Microsoft Fabric, the end-to-end, unified analytics platform that brings together all the data and analytics tools that organizations need. Spark Compute enables data engineering and data science scenarios on a fully managed Spark compute platform that delivers unparalleled speed and efficiency.

Spark Compute is a way of telling Spark what kind of resources are needed for data analysis tasks. A Spark pool can be named, and the number and size of the nodes (the machines that perform the processing) can be selected. Additionally, Spark can be configured to adjust the number of nodes dynamically based on the workload.

Spark Compute operates on OneLake, the data lake service that powers Microsoft Fabric. OneLake provides a unified location for storing and accessing all types of data. It also supports external sources like Amazon S3, Google Cloud Platform integration using Shortcuts.

An instance of Spark will be initiated only when we connect to it, which is using one of the methods below,

  • Executing a code in Notebook
  • Executing Spark Job Definition

How to Use Spark Compute?

There are two ways to use Spark Compute in Microsoft Fabric:

  • Starter pools
  • Custom pools

 

Starter Pools

This is what is by default associated with a workspace when assigned to a Fabric capacity. Starter pools provide a fast and easy way to use Spark on the Microsoft Fabric platform within seconds. Spark sessions can be used immediately, without waiting for nodes to be set up. This allows for quicker data processing and insights.

Starter pools have Spark clusters that are always on and ready to handle requests. It offers rapid Spark session initialization typically within 5 to 10 seconds without the need for manual setup. They use medium nodes that dynamically scale based on the needs of the Spark job.

Default configurations and the max node limits supported for starter pools based on Microsoft Fabric capacity SKUs:

  • Node family: Memory Optimized
  • Node Size: Medium
  • Autoscale: On
  • Dynamic Allocation: On
SKU name Capacity units Spark VCores Node size Default max nodes Max number of nodes
F2 2 4 Medium 1 1
F4 4 8 Medium 1 1
F8 8 16 Medium 2 2
F16 16 32 Medium 3 4
F32 32 64 Medium 8 8
F64 64 128 Medium 10 16
(Trial Capacity) 64 128 Medium 10 16
F128 128 256 Medium 10 32
F256 256 512 Medium 10 64
F512 512 1024 Medium 10 128
F1024 1024 2048 Medium 10 200
F2048 2048 4096 Medium 10 200

Billing for starter pools is based only on active Spark session usage. There is no charge for the time Spark keeps nodes ready for use.

For example, when a notebook job is submitted to a starter pool, billing applies only during the period when the notebook session is active.

Custom Pools

Custom Spark pools allow for setting up Spark resources based on specific needs. They offer the flexibility to choose the number of nodes, their sizes, and other configurations. This ensures the pool is tailored to handle data processing tasks efficiently. Custom Spark pools can be created for a workspace and set as the default option for other users. This approach saves time by eliminating the need to set up a new Spark pool each time a notebook or Spark job is run. Custom Spark pools typically take about three minutes to start, as Spark must retrieve the nodes from Azure.

Several aspects of the custom pool can be configured, such as:

  • Node size: A Spark pool can be defined with node sizes that range from a small compute node (with 4 vCore and 28 GB of memory) to a double extra-large compute node (with 64 vCore and 400 GB of memory per node).
Size vCore Memory
Small 4 28 GB
Medium 8 56 GB
Large 16 112 GB
X-Large 32 224 GB
XX-Large 64 400 GB
  • Node count: Specify the minimum and maximum number of nodes in the custom pool.
  • Autoscale: Enable autoscale to let Spark automatically adjust the number of nodes based on workload demand.
  • Dynamic allocation: Enable dynamic allocation to allow Spark to dynamically assign executors based on workload demand.

Custom pool configurations for F64 capacity:

Fabric capacity SKU Capacity units Spark VCores Node size Max number of nodes
F64 64 128 Small 32
F64 64 128 Medium 16
F64 64 128 Large 8
F64 64 128 X-Large 4
F64 64 128 XX-Large 2

Creating a custom pool is free, with charges only applied when Spark jobs run on the pool.

Conclusion

Spark Compute is a powerful and flexible tool for utilizing Spark on Microsoft Fabric. It enables a wide range of data engineering and data science tasks on data stored in OneLake or other sources. With options for creating and managing Spark pools, including starter pools for quick access and custom pools for specific configurations, it provides the versatility needed to meet diverse workloads and preferences


Nayan Sagar N K

Leave a Reply

Your email address will not be published. Required fields are marked *