Apache Spark is an open-source parallel processing framework for large-scale data processing and analytics. Spark has become popular in big data processing scenarios, and is available in multiple platform implementations, including Microsoft Fabric. Apache Spark is a distributed data processing framework that enables large-scale data analytics by coordinating work across multiple processing nodes in a cluster, known in Microsoft Fabric as a Spark pool. Put more simply, Spark uses a “divide and conquer” approach to processing large volumes of data quickly by distributing the work across multiple computers. The process of distributing tasks and combining results is handled for us by Spark.
The use of Spark in Microsoft Fabric is fully managed and abstracted. This means that the complexity of spinning up a Spark instance is managed behind the scenes. We have the ability to control some of the configurations and settings, but most of the hard work is done by Fabric. Spark compute is used specifically for the data engineering and data science experience in Fabric, and specifically is in use when we run a Notebook or a Spark job Definition.
Spark can run code written in a wide range of languages, including Java, Scala, R, SQL (spark SQL) and Python (PySpark). In practice, most data engineering and analytics workloads are accomplished using a combination of PySpark and Spark SQL.
Spark pools
A Spark pool consists of compute nodes that distribute data processing tasks. The general architecture is shown in the following diagram.
A Spark pool contains two kinds of node:
Spark pools in Microsoft Fabric
Microsoft Fabric provides a starter pool in each workspace, enabling Spark jobs to be started and run quickly with minimal setup and configuration. The starter pool can be configured for a specific workload. Additionally, custom Spark pools can be created with specific node configurations that support the data processing needs.
Starter pool can be managed in the Data Engineering/Science section of the workspace settings.
Specific configuration settings for Spark pools include:
If creating one or more custom Spark pools in a workspace, one of them (or the starter pool) can be set as the default pool to be used if a specific pool is not specified for a given Spark job.
Runtimes and Environments
In Data engineering of spark, different runtime versions determine which Apache Spark, Delta Lake, Python, and other core components are used. Variety of libraries can be added for both general and specialized tasks. Organizations often create multiple environments to handle different data processing needs. Each environment includes a specific runtime version and the necessary libraries. Data engineers and scientists can then choose the right environment for their tasks in a Spark pool.
Environments in Microsoft Fabric
Custom environments can be created in a Fabric workspace, enabling us to use specific Spark runtimes, libraries, and configuration settings for different data processing operations.
When creating an environment, we can:
After creating at least one custom environment, we can specify it as the default environment in the workspace settings.
High concurrency mode
High concurrency mode enables users to share the same Spark sessions in Spark across data engineering and data science workloads. When using this mode, each notebook or job operates within its own isolated environment, even though they share the same Spark session. This means that multiple items can run concurrently without interfering with each other.
How does it work?
Session sharing conditions include:
CONCLUSION
Apache Spark, known for its powerful parallel processing capabilities, offers data processing solutions within Microsoft Fabric. By leveraging Spark pools, users can efficiently distribute tasks across multiple compute nodes, while the flexibility of multiple runtime versions and custom environments help in various data processing needs. The introduction of High-Concurrency Mode further enhances the efficiency and cost-effectiveness of Spark operations, allowing multiple concurrent workloads to run smoothly and securely within shared Spark sessions. Whether using starter or custom Spark pools, and leveraging specific environments, Microsoft Fabric provides a robust platform for tackling complex data challenges.
Nayan Sagar N K