Advanced Spark Concepts for Job Interviews: Part 2

5 min readApr 2, 2022

In this part, you’ll learn about spark memory allocation and memory management.

I recommend you to read Advanced Spark Concepts for Job Interview: Part 1 before getting started with this article.

SPARK Memory Allocation

Assume you submitted a spark application in a YARN cluster. The YARN RM will allocate an application master (AM) container and start the driver JVM in the container. The driver will start with some memory allocation which you requested.

You can ask for the driver’s memory using two configurations.

spark.driver.memory
spark.driver.memoryOverhead

So, let’s assume you asked for the spark.driver.memory as 1GB and the default value of spark.driver.memoryOverhead as 0.10

The YARN RM will allocate 1 GB memory for the driver JVM, and 10% of requested memory or 384 MB, whatever is higher for container overhead.

The overhead memory is used by the container process or any other non JVM process within the container. Your Spark driver uses all the JVM heap but nothing from the overhead.

So, the driver will again request the executor containers from the YARN. The YARN RM will allocate a bunch of executor containers.

The total memory allocated to the executor container is the sum of the following:

Overhead Memory
Heap Memory
Off Heap Memory
PySpark Memory

So, a Spark driver will ask for executor container memory using four configurations.

What are the configurations used for executor container memory?

Overhead memory is the spark.executor.memoryOverhead
JVM Heap is the spark.executor.memory.
Off Heap memory comes from spark.memory.offHeap.size.
The PySpark memory comes from the spark.executor.pyspark.memory.

So, the driver will look at all these configurations to calculate your memory requirement and sum it up.

The container should run on a worker node in the YARN cluster. What if the worker node is a 6 GB machine? YARN cannot allocate an 8 GB container on a 6 GB machine due to lack of physical memory. Before you ask for the driver or executor memory, check with your cluster admin for the maximum allowed value.

While using YARN RM, you should look for the following configurations.

yarn.scheduler.maximum-allocation-mb
yarn.nodemanager.resource.memory-mb

You do not need to worry about PySpark memory if you write your Spark application in Java or Scala. But if you are using PySpark, this question becomes critical.

PySpark is not a JVM process but overhead memory. Some of which is constantly consumed by the container and other internal processes. If your PySpark occupies more than accommodated in the overhead, you will see an OOM error.

Let’s have a quick recap

You have a container, and the container has got some memory.

2. This total memory is broken into two parts: Heap memory( driver/executor memory) and Overhead memory (OS Memory)

3. The heap memory goes to your JVM.

4. We call it driver memory when running a driver in this container. Similarly, we call it executor memory when the container runs an executor.

The overhead memory uses for a bunch of things. The overhead uses for network buffers. So, you will be using overhead memory as your shuffle exchange or reading partition data from remote storage, etc.
Both the memory portions are critical for your Spark application. And more often, lack of enough overhead memory will cost you an OOM exception. As the overhead memory is overlooked, it is used as shuffle exchange or network read buffer.

Spark Memory Management

Let’s focus on the JVM memory in this part. The heap memory is further broken down into three parts.

Reserved Memory
Spark Memory Pool
User Memory

So, let’s assume I got 8 GB for the JVM heap. This 8 GB is divided into three parts. Spark will reserve 300 MB for itself. That’s fixed, and the Spark engine itself uses it.

The next part is the Spark executor memory pool, controlled by the spark.memory.fraction configuration, and the default value is 60%. So, for example, the spark memory pool translates to 4620 MB.

How do Spark Memory Pools work?

We have got 8 GB or 8000 MB. Three hundred is gone for Reserved memory. Right? We are remained with 7700 MB. Now take 60% of this, and you will get 4620 MB for the Spark memory pool. What is left? We are left with 3080 MB, and gone for user memory.

Now let’s try to understand the three memory pools.

The Reserved Pool is gone for the Spark engine itself. You cannot use it.

2. The Spark Memory Pool is your main executor memory pool which you will use for data frame operations and caching.

The Spark memory pool is where all your data frames and data frame operations live. You can increase it from 60% to 70% or even more if you are not using UDFs, custom data structures, and RDD operations. But you cannot make it zero or reduce it too much because you will need it for metadata and other internal things.

Spark Executor Memory Pool is further broken down into two sub pools.

Storage Memory
Executor Memory

The default break up for each sub pool is 50% each, but you can change it using the spark.memory.storageFraction configuration.

We use the Storage Pool for caching data frames and the Executor Pool is to perform data frame computations.

3. The User Memory Pool is used for non data frame operations.

Here are some examples for User Memory Pool:

If you created user defined data structures such as hash maps, Spark would use the User Memory Pool.
Similarly, Spark internal metadata and user-defined functions are stored in the user memory.
All the RDD information and the RDD operations are performed in user memory.

But if you are using Data Frame operations, they do not use the user memory even if the data frame is internally translated and compiled into RDD. You will be using user memory only if you apply RDD operations directly in your code.

END OF PART 2

SPARK RDDs

In this article we will go through the concept of RDDs and some of the important transformations and actions on RDDs…

sharmashorya1996.medium.com

Spark In Depth

Data is increasing in Volume, Velocity and Variety. There is a blast of information. Regardless of where you look…

sharmashorya1996.medium.com

Advance Spark Concepts for Job Interview : Part 1

This blog will cover some of the advance topics in spark which will prepare you for your job interview.

sharmashorya1996.medium.com

Advanced Spark Concepts for Job Interviews: Part 2

SPARK Memory Allocation

What are the configurations used for executor container memory?

Let’s have a quick recap

Spark Memory Management

How do Spark Memory Pools work?

SPARK RDDs

In this article we will go through the concept of RDDs and some of the important transformations and actions on RDDs…

Spark In Depth

Data is increasing in Volume, Velocity and Variety. There is a blast of information. Regardless of where you look…

Advance Spark Concepts for Job Interview : Part 1

This blog will cover some of the advance topics in spark which will prepare you for your job interview.

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by shorya sharma

No responses yet