Advanced Spark Concepts for Job Interviews: Part 1
This blog will cover some of the advanced concepts in Spark such as resource management, shuffling, catalyst optimizer, etc which will prepare you for your job interview.
The topic will be divided into two parts and expects you to have a basic knowledge of Spark.
Check out my other blogs to get yourself started with your Spark journey.
Let’s revise some concepts of Spark
- Spark is a distributed computing platform
- Spark Application is a distributed application
- Spark Application needs a cluster, e.g., Hadoop YARN and Kubernetes
What is a cluster?
A pool of computers working together but viewed as a single system, e.g., I have ten worker nodes, each with 16 CPU cores and 64 GB RAM, So, my total CPU capacity is 160 CPU cores, and the RAM capacity is 640 GB.
How is a Spark Application run on the cluster?
I will use the Spark Submit command and submit my Spark Application to the cluster.
My request will go to the YARN resource manager. The YARN RM will create one Application Master Container on a worker node and start my application’s main() method in the container.
What is a container?
A container is an isolated virtual runtime environment. It comes with some CPU and memory allocation. For example, let’s assume YARN RM gave 4 CPU Cores and 16 GB memory to this container and started it on a worker node.
Now my application’s main() method will run in the container, and it can use 4 CPU cores and 16 GB of memory.
The container is running main() method on my application, and we have two possibilities here.
- PySpark Application
- Scala Application
So, let’s assume our application is a PySpark application. But Spark is written in Scala, and it runs in the Java virtual machine.
Spark developers wanted to bring this to Python developers, so they created a Java wrapper on top of Scala code, then created Python wrapper on top of the Java wrappers, and the Python wrapper is known as PySpark.
If you submitted the PySpark code, you would have a PySpark driver, and you will also have a JVM driver. These two will communicate using Py4J.
The PySpark main method is the PySpark Driver, and the JVM application is Application Driver.
The Application driver distributes the work to others. So, the driver does not perform any data processing work. Instead, it will create some executors and get the work-done from them.
But how does Application Driver work?
After starting, the driver will go-back to the YARN RM and ask for some more containers. The RM will create some more containers on worker nodes and give them to the driver.
Now the driver will start spark executor in these containers. Each container will run one Spark executor, and the Spark executor is a JVM application.
So, your driver is a JVM application, and your executor is also a JVM application. These executors are responsible for doing all the data processing work. The driver will assign work to the executors, monitor them, and manage the overall application, but the executors do all the data processing.
PySpark code translated into Java code, and runs in the JVM. But if you are using some Python libraries which doesn’t have a Java wrapper, you will need a Python runtime environment to run them. So, the executors will create a Python runtime environment so they can execute your Python code.
Spark Submit and its Options
What is Spark Submit?
The Spark Submit is a command-line tool that allows you to submit the Spark application to the cluster.
spark-submit --class<main-class> --master<master-url> --deploy-mode<deploy-mode> <application-jar>[application-args]
What are the types of Deploy Modes?
The Spark Submit allows you to submit the spark application to the cluster, and you can apply to run in one of the two modes.
- Cluster Mode
- Client Mode
In the cluster mode, Spark Submit will reach the YARN RM, requesting it to start the driver in an AM container. YARN will start your driver in the AM container on a worker node in the cluster. Then the driver will again request YARN to begin executor containers. So, the YARN will start executor containers and hand them over to the driver. So, in cluster mode, your driver is running in the AM container on a worker node in cluster. Your executors are also running in the executor containers on some worker nodes in cluster.
In the Client Mode, Spark Submit doesn’t go to the YARN resource manager for starting an AM container. Instead, the spark-submit command will start the driver JVM directly on the client machine. So, in this case, the spark driver is a JVM application running on your client machine. Now the driver will reach out to the YARN resource manager requesting executor containers. The YARN RM will start executor containers and hand them over to the driver. The driver will start executors in those containers to do the job.
Client Machines in the Spark Cluster are also known as Gateway Nodes.
How do we choose the deploy mode?
You will almost always submit your application in cluster mode. It is unlikely that you submit your spark application in client mode. We have two clear advantages of running your application in cluster mode.
- The cluster mode allows you to submit the application and log off from the client machine, as the driver and executors run on the cluster. They have nothing active on your client’s machine. So, even if you log off from your client machine, the driver and executor will continue to run in the cluster.
- Your application runs faster in cluster mode because your driver is closer to the executors. The driver and executor communicate heavily, and you don’t get impacted by network latency, if they are close.
The designed client mode is for interactive workloads. For example, Spark Shell runs your code in client mode. Similarly, Spark notebooks also use the client mode.
Spark Jobs: Stage, Shuffle, Tasks, Slots
- Spark creates one job for each action.
- This job may contain a series of multiple transformations.
- The Spark engine will optimize those transformations and creates a logical plan for the job.
- Then spark will break the logical plan at the end of every wide dependency and create two or more stages.
- If you do not have a wide dependency, your plan will be a single-stage plan.
- But if you have N wide-dependencies, your plan should have N+1 stages.
- Data from one stage to another stage is shared using the shuffle/sort operation.
- Now each stage may be executed as one or more parallel tasks.
- The number of tasks in the stage is equal to the number of input partitions.
The task is the most critical concept for a Spark job and is the smallest unit of work in a Spark job. The Spark driver assigns these tasks to the executors and asks them to do the work.
The executor needs the following things to perform the task.
- The task Code
- Data Partition
So, the driver is responsible for assigning a task to the executor. The executor will ask for the code or API to be executed for the task. It will also ask for the data frame partition on which to execute the given code. The application driver facilitates both these things for the executor, and the executor performs the task.
Now, let’s assume I have a driver and four executors. Each executor will have one JVM process. But I assigned 4 CPU cores to each executor. So, my Executor JVM can create four parallel threads and that’s the slot capacity of my executor.
So, each executor can have four parallel threads, and we call them executor slots. The driver knows how many slots we have at each executor and it is going to assign tasks to fit in the executor slots.
The last stage will send the result back to the driver over the network. The driver will collect data from all the tasks and present it to you.
Spark SQL Engine and Query Planning
Apache Spark gives you two prominent interfaces to work with data.
- Spark SQL
- Dataframe API
You may have Dataframe APIs, or you may have SQL both will go to the Spark SQL engine.
For Spark, they are nothing but a Spark Job represented as a logical plan.
The Spark SQL Engine will process your logical plan in four stages.
- The Analysis stage will parse your code for errors and incorrect names. The Analysis phase will parse your code and create a fully resolved logical plan. Your code is valid if it passes the Analysis phase.
- The Logical Optimization phase applies standard rule-based optimizations to the logic plan.
- Spark SQL takes a logical plan and generates one or more in the Physical Planning phase. Physical planning phase considers cost based optimization. So the engine will create multiple plans to calculate each plan’s cost and select the one with the low cost. At this stage the engine use different join algorithms to generate more than one physical plan.
- The last stage is Code Generation. So, your best physical plan goes into code generation, the engine will generate Java byte code for the RDD operations, and that’s why Spark is also said to act as a compiler as it uses state of the art compiler technology for code generation to accelerate execution.
END OF PART 1