Apache Spark is presently one of the most popular big data technologies in the business, with firms like Databricks supporting it.
When utilising Spark, one of the primary duties of Data Engineers is to develop highly optimised code in order to properly utilise Spark’s distributed compute capabilities.
This post will introduce you to some of the most prevalent performance issues while using Spark (for example, the 5 Ss) and how to handle them.
The 5 Ss
Spill, Skew, Shuffle, Storage, and Serialisation are the five most prevalent performance issues in Spark. There are two major general ways that can be utilised to improve Spark performance in any situation:
Reduce the amount of data that is ingested.
Reducing the amount of time Spark spends reading data (for example, by combining Predicate Pushdown with Disk Partitioning/Z Order Clustering).
Spill
Spilling occurs when a partition is too large to fit in RAM and writes temporary files to disk. To avoid Out of Memory (OOM) issues, an RDD is first relocated from RAM to Disk and then back to RAM. Disk reads and writes, on the other hand, can be quite expensive to compute and should thus be avoided as much as possible.
When performing Spark Jobs, Spill can be better understood by looking at the Spill (Memory) and Spill (Disk) values in the Spark UI.
Spill (Memory): the amount of data that has been spilled in memory.
Spill (Disk): the size of the spilled partition’s data on the Disk.
To reduce spill, two techniques are possible: instantiating a cluster with greater memory per worker or increasing the number of partitions (therefore making the existing partitions smaller).
Skew
When utilising Spark, data is typically read in 128 MB evenly distributed partitions. When multiple transformations are applied to the data, some partitions may become significantly larger or smaller than their average.
Skew is caused by an imbalance in the sizes of the divisions. Skew in small amounts can be entirely fine, but in some cases, it can cause Spill and OutOfMemory issues.
shows two techniques to reducing Skew: salting the skewed column with random integers to spread partition sizes.
Making Use of Adaptive Query Execution
Shuffle
Shuffle occurs when data is moved between executors during wide transformations (e.g., joins, groupBy, etc…) or some activities such as count. Skew can occur when shuffling difficulties are handled incorrectly.
Some ways that can be utilised to lessen the amount of shuffling are as follows:
Creating fewer and larger workers (thus lowering network IO overheads).
Before shuffling, prefilter data to reduce its size.
Denormalize the datasets in question.
Solid State Drives are preferred over Hard Disk Drives for speedier execution.
Broadcast Hash Join the smaller table while working with small tables. SortMergeJoin should be used instead of Broadcast Hash Join for large tables (Broadcast Hash Join can cause Out Of Memory issues with large tables).
Storage
When data is saved on Disk in an inefficient manner, storage concerns develop. Storage issues have the potential to generate undue Shuffle. Tiny files, scanning, and schemas are three of the most common storage issues.
Tiny Files: managing partition files with a size of less than 128 MB.
Scanning directories: When scanning directories, we may have a huge list of files in a single directory or, in the case of heavily partitioned datasets, numerous tiers of folders. We can register it as a table to reduce the quantity of scanning.
Schema: Schema issues might vary based on the file format used. To infer data types, for example, using JSON and CSV, the entire data set must be read.
Instead, just one file must be read for Parquet, but the entire list of Parquet files must be read if we want to manage possible schema changes over time. It may then be beneficial to offer schema definitions in advance in order to increase performance.
Serialization
Serialisation incorporates all of the issues involved with code distribution across clusters (code is serialised, transmitted to executors, and then deserialized).
In the case of Python, this process can be considerably more difficult because the code must be pickled and a Python interpreter instance must be allocated to each executor.
When merging codebases with legacy systems (e.g., Hadoop), third-party libraries, and bespoke frameworks, serialisation difficulties can develop.
One key strategy Avoiding the use of UDFs or Vectorized UDFs (which operate as a black box for the Catalyst Optimizer) is an important strategy for reducing serialisation difficulties.
hari.prasad