We have already discussed the overview of Apache Spark in Microsoft fabric in this previous blog,
https://datasturdy.com/unlocking-big-data-power-apache-spark-and-microsoft-fabric-for-scalable-data-processing/
Spark Compute is a key component of Microsoft Fabric, the end-to-end, unified analytics platform that brings together all the data and analytics tools that organizations need. Spark Compute enables data engineering and data science scenarios on a fully managed Spark compute platform that delivers unparalleled speed and efficiency.
Spark Compute is a way of telling Spark what kind of resources are needed for data analysis tasks. A Spark pool can be named, and the number and size of the nodes (the machines that perform the processing) can be selected. Additionally, Spark can be configured to adjust the number of nodes dynamically based on the workload.
Spark Compute operates on OneLake, the data lake service that powers Microsoft Fabric. OneLake provides a unified location for storing and accessing all types of data. It also supports external sources like Amazon S3, Google Cloud Platform integration using Shortcuts.
An instance of Spark will be initiated only when we connect to it, which is using one of the methods below,
There are two ways to use Spark Compute in Microsoft Fabric:
This is what is by default associated with a workspace when assigned to a Fabric capacity. Starter pools provide a fast and easy way to use Spark on the Microsoft Fabric platform within seconds. Spark sessions can be used immediately, without waiting for nodes to be set up. This allows for quicker data processing and insights.
Starter pools have Spark clusters that are always on and ready to handle requests. It offers rapid Spark session initialization typically within 5 to 10 seconds without the need for manual setup. They use medium nodes that dynamically scale based on the needs of the Spark job.
Default configurations and the max node limits supported for starter pools based on Microsoft Fabric capacity SKUs:
SKU name | Capacity units | Spark VCores | Node size | Default max nodes | Max number of nodes |
F2 | 2 | 4 | Medium | 1 | 1 |
F4 | 4 | 8 | Medium | 1 | 1 |
F8 | 8 | 16 | Medium | 2 | 2 |
F16 | 16 | 32 | Medium | 3 | 4 |
F32 | 32 | 64 | Medium | 8 | 8 |
F64 | 64 | 128 | Medium | 10 | 16 |
(Trial Capacity) | 64 | 128 | Medium | 10 | 16 |
F128 | 128 | 256 | Medium | 10 | 32 |
F256 | 256 | 512 | Medium | 10 | 64 |
F512 | 512 | 1024 | Medium | 10 | 128 |
F1024 | 1024 | 2048 | Medium | 10 | 200 |
F2048 | 2048 | 4096 | Medium | 10 | 200 |
Billing for starter pools is based only on active Spark session usage. There is no charge for the time Spark keeps nodes ready for use.
For example, when a notebook job is submitted to a starter pool, billing applies only during the period when the notebook session is active.
Custom Spark pools allow for setting up Spark resources based on specific needs. They offer the flexibility to choose the number of nodes, their sizes, and other configurations. This ensures the pool is tailored to handle data processing tasks efficiently. Custom Spark pools can be created for a workspace and set as the default option for other users. This approach saves time by eliminating the need to set up a new Spark pool each time a notebook or Spark job is run. Custom Spark pools typically take about three minutes to start, as Spark must retrieve the nodes from Azure.
Several aspects of the custom pool can be configured, such as:
Size | vCore | Memory |
Small | 4 | 28 GB |
Medium | 8 | 56 GB |
Large | 16 | 112 GB |
X-Large | 32 | 224 GB |
XX-Large | 64 | 400 GB |
Custom pool configurations for F64 capacity:
Fabric capacity SKU | Capacity units | Spark VCores | Node size | Max number of nodes |
F64 | 64 | 128 | Small | 32 |
F64 | 64 | 128 | Medium | 16 |
F64 | 64 | 128 | Large | 8 |
F64 | 64 | 128 | X-Large | 4 |
F64 | 64 | 128 | XX-Large | 2 |
Creating a custom pool is free, with charges only applied when Spark jobs run on the pool.
Spark Compute is a powerful and flexible tool for utilizing Spark on Microsoft Fabric. It enables a wide range of data engineering and data science tasks on data stored in OneLake or other sources. With options for creating and managing Spark pools, including starter pools for quick access and custom pools for specific configurations, it provides the versatility needed to meet diverse workloads and preferences
Nayan Sagar N K