Single Node Cluster Setup with Spark 2.x/3.x

8 min readJun 7, 2022

In this tutorial we will be going through on how to set up a single node cluster on Ubuntu machine on cloud.

We will be using a Google Cloud Ubuntu 18.04 for virtual Machine and setup our single node cluster , you may use AWS or Azure cloud.

we will be setting up a local hive Metastore,

Go to Google cloud platform , register and start your free trial. GCP provides $300 as free credits.

Once the signup is successful you will be redirected to a page like this,

Click on compute Engine on the left pane and start a VM instance with OS as Ubuntu 18.04 .

Now follow the below steps once you have started your VM instance.

Setup Python and Java

### Update existing list of packages
sudo apt update 

### Install pip
sudo apt install python3-pip

### Install venv
sudo apt install python3-venv

### Validate venv
python3 -m venv tutorial-env
ls -ltr
rm -rf tutorial-env

### Install Java JDK
sudo apt-get install openjdk-8-jdk

### Validate Java
java -version
javac -version

Setup Secure Connect to localhost

### Check if SSH is installed
ssh

### Generate the provate and public key using ssh-keygen
ssh-keygen

### Copy the content of public key to authorized key file.
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

### Test ssh localhost
ssh localhost

###Validate by exit
exit

Setup Hadoop tar, HDFS, YARN and manage cluster services.

#### Download hadoop tar (Latest as of today).
wget https://mirrors.gigenet.com/apache/hadoop/common/hadoop-3.3.1/hadoop-3.3.1.tar.gz

### Untar the File.
tar xfz hadoop-3.3.1.tar.gz

#Archive all the softwares.
mkdir softwares
mv hadoop-3.3.1.tar.gz softwares

### Set up folder structure.
sudo mv -f hadoop-3.3.1 /opt

### Change the ownership to User.
sudo chown ${USER}:${USER} -R /opt/hadoop-3.3.1

#Create a softlink as /opt/hadoop.
sudo ln -s /opt/hadoop-3.3.1 /opt/hadoop

### Update below three lines in the .profile file.
export HADOOP_HOME=/opt/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin 
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-amd64

### Source the .profile or exit the Session.
source .profile

### Validate if the changes are reflected.
echo $JAVA_HOME
echo $HADOOP_HOME
echo $PATH

### core-site.xml : Informs Hadoop where NameNode runs in the cluster. 
/opt/hadoop/etc/hadoop/core-site.xml
<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
</configuration>

### hdfs-site.xml : Contains the configuration settings for HDFS daemons like the NameNode, the Secondary NameNode, and the DataNodes. 
/opt/hadoop/etc/hadoop/hdfs-site.xml
<configuration>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>/opt/hadoop/dfs/name</value>
    </property>
    <property>
        <name>dfs.namenode.checkpoint.dir</name>
        <value>/opt/hadoop/dfs/namesecondary</value>
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>/opt/hadoop/dfs/data</value>
    </property>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
   <property>
    <name>dfs.blocksize</name>
    <value>134217728</value>
  </property>
</configuration>

### hadoop-env.sh : Contains environment variable settings used by Hadoop.
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-amd64
export HADOOP_OS_TYPE=${HADOOP_OS_TYPE:-$(uname -s)}

### Format Namenode
hdfs namenode -format
ls -ltr /opt/hadoop/dfs/

###Start HDFS components.
start-dfs.sh

#### Validate the HDFS Services are running
jps

### Test HDFS Commands.
hadoop fs -ls /
hadoop fs -mkdir -p /user/${USER}
hadoop fs -ls /user/${USER}
hadoop fs -copyFromLocal test.txt /user/${USER}

### 
yarn-site.xml :  This file contains the yarn configuration options.
/opt/hadoop/etc/hadoop/yarn-site.xml

<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>
        <name>yarn.nodemanager.env-whitelist</name>
        <value>HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,JAVA_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
    </property>
</configuration>

mapred-site.xml : MapReduce configuration options are stored in this file.
/opt/hadoop/etc/hadoop/mapred-site.xml
<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
    <property>
        <name>mapreduce.application.classpath</name>
        <value>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*</value>
    </property>
</configuration>

### Start YARN Components.
start-yarn.sh

### Validate YARN Services are running.
jps

### All these start and stop services are available in :
/opt/hadoop/sbin

### Follow this order to stop the services.
1. stop-yarn.sh
2. stop-dfs.sh
3. Stop the Instance

### Follow this order to start the services.
1. start the Instance
2. start-dfs.sh
3. start-yarn.sh

Setup Docker , postgresSql, Hive, Metastore

#####################
 ## Docker Set up ##
#####################
### Update the existing list of packages.
sudo apt update### Install few prerequisite packages.
sudo apt install apt-transport-https ca-certificates curl software-properties-common### Add the GPG key for the official Docker repository to our host.
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -### Add the Docker repository to APT sources.
sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu bionic stable"### Update the package database with the Docker packages from the newly added repo.
sudo apt update### install the Docker policies
apt-cache policy docker-ce### Install the docker
sudo apt install docker-ce### Validate Docker Instalaltion.
sudo systemctl status docker### Run docker commands with out sudo, we have to add the user to docker group. 
sudo usermod -aG docker ${USER}Log out and login to validate.### Validate if user is added to docker group.
id ${USER}
docker images### To check whether you can access and download images from Docker Hub, type:
docker run hello-world########################
 ## PostgrSQL Set up ##
########################
### Now,Lets Set up Postgress Image in the Docker Container### Create the Container of type Postgress Image.
docker create \
    --name postgress_container \
    -p 6432:5432 \
    -e POSTGRES_PASSWORD=shorya \
    postgres
 
### Start the Container. 
docker start postgress_container### Check if the container is running.
docker logs -f postgress_containerhit Control+c To come out.### We can validate if we are able to run postgress from docker.docker exec \
    -it postgress_container \
    psql -U postgres
 
### Createa database "metastore" for hive in postgress.
CREATE DATABASE metastore;
CREATE USER hive WITH ENCRYPTED PASSWORD 'shorya';
GRANT ALL ON DATABASE metastore TO hive;\l to list
\q to exit postgress### If you want to access postgress from host, we have to install a postgress client.
sudo apt install postgresql-client -ypsql -h localhost \
    -p 6432 \
    -d metastore \
    -U hive \
    -W\d to list tables.
\q to exit########################
 ###    Hive Set up ###
########################
### Download Hive.
wget https://mirrors.ocf.berkeley.edu/apache/hive/hive-3.1.2/apache-hive-3.1.2-bin.tar.gz### Untar Hive File.
tar xzf apache-hive-3.1.2-bin.tar.gz### Archive Hive tar file.
mv apache-hive-3.1.2-bin.tar.gz softwares### Set up Hive folder Structure.
sudo mv -f apache-hive-3.1.2-bin /opt### Create a soft link.
sudo ln -s /opt/apache-hive-3.1.2-bin /opt/hive### Update HIVE_HOME in the .profile File
cd
export HIVE_HOME=/opt/hive
export PATH=$PATH:${HIVE_HOME}/bin### Execute Profile or exit the session and connect again
source .profile###hive-site.xml : Global Configuration File for Hive
/opt/hive/conf/hive-site.xml
<configuration>
  <property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:postgresql://localhost:6432/metastore</value>
    <description>JDBC Driver Connection for PostgrSQL</description>
  </property>
 
  <property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>org.postgresql.Driver</value>
    <description>PostgreSQL metastore driver class name</description>
  </property>
 
  <property>
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>hive</value>
    <description>Database User Name</description>
  </property>
 
  <property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>shorya</value>
    <description>Database User Password</description>
  </property>
</configuration>
 
### Remove the conflicting Guava Files if present.
rm /opt/hive/lib/guava-19.0.jar
cp /opt/hadoop/share/hadoop/hdfs/lib/guava-27.0-jre.jar /opt/hive/lib/### Download a postgresql jar file and copy it to /opt/hive/lib/
wget https://jdbc.postgresql.org/download/postgresql-42.2.24.jar
sudo mv postgresql-42.2.24.jar /opt/hive/lib/postgresql-42.2.24.jar### Initialize Hive Metastore
schematool -dbType postgres -initSchema### Validate Metadata Tables
docker exec \
    -it postgress_container \
    psql -U postgres \
    -d metastore
 
\d
\q

Setup spark 2.x/ 3.x

### Update PYSPARK_PYTHON at .profile
export PYSPARK_PYTHON=python3
source .profile### Update /opt/hive/conf/hive-site.xml with below property.
 <property>
 <name>hive.metastore.schema.verification</name>
 <value>false</value>
 </property>### Spark Website to Download. 
https://downloads.apache.org/spark/### Download Spark
#2.x
wget https://downloads.apache.org/spark/spark-2.4.8/spark-2.4.8-bin-hadoop2.7.tgz
#3.x
wget https://ftp.wayne.edu/apache/spark/spark-3.2.1/spark-3.2.1-bin-hadoop3.2.tgzNote — It is possible that you may find the above link. In that case go to that website “https://ftp.wayne.edu/apache/spark"
 and chose an appropriate link.### Untar the File
#2.x
tar xzf spark-2.4.8-bin-hadoop2.7.tgz
#3.x
tar xzf spark-3.2.1-bin-hadoop3.2.tgz### Archive the tar file
#2.x
mv spark-2.4.8-bin-hadoop2.7.tgz softwares
#3.x
mv spark-3.2.1-bin-hadoop3.2.tgz softwares### Set up Spark Folder Structure
#2.x
sudo mv -f spark-2.4.8-bin-hadoop2.7 /opt
#3.x
sudo mv -f spark-3.2.1-bin-hadoop3.2 /opt### Set up Soft Link
#2.x
sudo ln -s spark-2.4.8-bin-hadoop2.7 /opt/spark2
#3.x
sudo ln -s spark-3.2.1-bin-hadoop3.2 /opt/spark3### spark-env.sh 
#2.x
Update /opt/spark2/conf/spark-env.sh with below environment variables.
export HADOOP_HOME=”/opt/hadoop”
export HADOOP_CONF_DIR=”/opt/hadoop/etc/hadoop”
export SPARK_DIST_CLASSPATH=$(hadoop — config ${HADOOP_CONF_DIR} classpath)
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HADOOP_HOME/lib/native#3.x
/opt/spark3/conf/spark-env.sh
export HADOOP_HOME=”/opt/hadoop”
export HADOOP_CONF_DIR=”/opt/hadoop/etc/hadoop”### spark-defaults.conf
#2.x
Update /opt/spark2/conf/spark-defaults.conf with below properties.
spark.driver.extraJavaOptions -Dderby.system.home=/tmp/derby/
spark.sql.repl.eagerEval.enabled true
spark.master yarn
spark.eventLog.enabled true
spark.eventLog.dir hdfs:///spark2-logs
spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider
spark.history.fs.logDirectory hdfs:///spark2-logs
spark.history.fs.update.interval 10s
spark.history.ui.port 18081
spark.yarn.historyServer.address localhost:18081
spark.yarn.jars hdfs:///spark2-jars/*.jar#3.x
Update /opt/spark3/conf/spark-defaults.conf with below properties.
spark.driver.extraJavaOptions -Dderby.system.home=/tmp/derby/
spark.sql.repl.eagerEval.enabled true
spark.master yarn
spark.eventLog.enabled true
spark.eventLog.dir hdfs:///spark3-logs
spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider
spark.history.fs.logDirectory hdfs:///spark3-logs
spark.history.fs.update.interval 10s
spark.history.ui.port 18080
spark.yarn.historyServer.address localhost:18080
spark.yarn.jars hdfs:///spark3-jars/*.jar### create directories for logs and jars in HDFS. 
#2.x
hdfs dfs -mkdir /spark2-jars
hdfs dfs -mkdir /spark2-logs
#3.x
hdfs dfs -mkdir /spark3-jars
hdfs dfs -mkdir /spark3-logs### Copy Spark jars to HDFS folder as part of spark.yarn.jars.
#2.x
hdfs dfs -put /opt/spark2/jars/* /spark2-jars
#3.x
hdfs dfs -put /opt/spark3/jars/* /spark3-jars### Integrate Spark with Hive Metastore. Create soft link for hive-site.xml in Spark conf folder.
#2.x
sudo ln -s /opt/hive/conf/hive-site.xml /opt/spark2/conf/
#3.x
sudo ln -s /opt/hive/conf/hive-site.xml /opt/spark3/conf/### Install Postgres JDBC jar in Spark jars folder.
#2.x
sudo wget https://jdbc.postgresql.org/download/postgresql-42.2.19.jar \
 -O /opt/spark2/jars/postgresql-42.2.19.jar
#3.x
sudo wget https://jdbc.postgresql.org/download/postgresql-42.2.19.jar \
 -O /opt/spark3/jars/postgresql-42.2.19.jar### Validate Spark using Scala 
#2.x
/opt/spark2/bin/spark-shell — master yarn — conf spark.ui.port=0
#3.x
/opt/spark3/bin/spark-shell — master yarn — conf spark.ui.port=0### Validate Spark using Python 
#2.x
/opt/spark2/bin/pyspark — master yarn — conf spark.ui.port=0
#3.x
/opt/spark3/bin/pyspark — master yarn — conf spark.ui.port=0
spark.sql(‘SHOW databases’).show()
spark.sql(‘USE test’)
spark.sql(‘SELECT * FROM spark’).show() 
exit()### All the commands(spark-shell,pyspark,spark-submit) are avaiable in:
/opt/spark2/bin
/opt/spark3/bin### Set these commands path at .profile. Add below two lines in .profile
#2.x
export PATH=$PATH:/opt/spark2/bin
#3.x
export PATH=$PATH:/opt/spark3/binsource .profile### Distringuish these commands in spark2 and spark3
#2.x
mv /opt/spark2/bin/pyspark /opt/spark2/bin/pyspark2
#3.x
mv /opt/spark3/bin/pyspark /opt/spark3/bin/pyspark3### Validate the commands
#2.x
pyspark2 — master yarn
#3.x
pyspark3 — master yarn############ END ############## Fix Python Warning (/opt/spark2/bin/pyspark2: line 45: python: command not found)
#2.x
cd /opt/spark2/bin
vi pyspark2
Edit the line 
    WORKS_WITH_IPYTHON=$(python -c 'import sys; print(sys.version_info >= (2, 7, 0))')
To
    WORKS_WITH_IPYTHON=$(python3 -c 'import sys; print(sys.version_info >= (2, 7, 0))')### Set spark-submit Files. we can use spark-submit command to launch applications or job on a cluster.
#2.x
cp /opt/spark2/bin/spark-submit spark2-submit#3.x
No such errors in PySpark 3.x--Create a Spark Job
cd
vi basic.py
print("Start ...")
  
from pyspark.sql import SparkSession
spark = SparkSession \
       .builder \
       .master('yarn') \
       .appName("Python Spark SQL basic example") \
       .getOrCreate()spark.sparkContext.setLogLevel('OFF')
print("Spark Object is created")
print("Spark Version used is :" + spark.sparkContext.version)print("... End")-- Submit the Job to the Cluster
spark2-submit --master yarn /home/${USER}/basic.py-- Turn off the INFO logs. Spark uses log4j for logging.
cp /opt/spark2/conf/log4j.properties.template /opt/spark2/conf/log4j.properties   
set log4j.rootCategory=INFO, console
to
log4j.rootCategory=WARN, consolespark2-submit --master yarn /home/${USER}/basic.py### Resolve Class path contains multiple SLF4J bindings
mv /opt/spark-2.4.8-bin-hadoop2.7/jars/slf4j-log4j12-1.7.16.jar /home/${USER}/softwaresspark2-submit --master yarn /home/${USER}/basic.py#3.x
cp /opt/spark3/bin/spark-submit /opt/spark3/bin/spark3-submit-- Submit the Job to the Cluster
spark3-submit --master yarn /home/${USER}/basic.py-- Turn off the INFO logs
cp /opt/spark3/conf/log4j.properties.template /opt/spark2/conf/log4j.properties 
set log4j.rootCategory=INFO, console
to
log4j.rootCategory=WARN, consolespark3-submit --master yarn /home/${USER}/basic.py################ END ########################

Getting Strated with PySpark With Examples

The article is mainly focused on people who have just started in the field of big data and Spark with Python, or even…

medium.com

How to use Hive and MySql in Pyspark along with some handy transformations

Hello again !! So you have a keen interest in PySpark that bought you here, or possibly some requirement at your…