Single Node Cluster Setup with Spark 2.x/3.x
In this tutorial we will be going through on how to set up a single node cluster on Ubuntu machine on cloud.
We will be using a Google Cloud Ubuntu 18.04 for virtual Machine and setup our single node cluster , you may use AWS or Azure cloud.
we will be setting up a local hive Metastore,
Go to Google cloud platform , register and start your free trial. GCP provides $300 as free credits.
Once the signup is successful you will be redirected to a page like this,
Click on compute Engine on the left pane and start a VM instance with OS as Ubuntu 18.04 .
Now follow the below steps once you have started your VM instance.
Setup Python and Java
### Update existing list of packages
sudo apt update
### Install pip
sudo apt install python3-pip
### Install venv
sudo apt install python3-venv
### Validate venv
python3 -m venv tutorial-env
ls -ltr
rm -rf tutorial-env
### Install Java JDK
sudo apt-get install openjdk-8-jdk
### Validate Java
java -version
javac -version
Setup Secure Connect to localhost
### Check if SSH is installed
ssh
### Generate the provate and public key using ssh-keygen
ssh-keygen
### Copy the content of public key to authorized key file.
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
### Test ssh localhost
ssh localhost
###Validate by exit
exit
Setup Hadoop tar, HDFS, YARN and manage cluster services.
#### Download hadoop tar (Latest as of today).
wget https://mirrors.gigenet.com/apache/hadoop/common/hadoop-3.3.1/hadoop-3.3.1.tar.gz
### Untar the File.
tar xfz hadoop-3.3.1.tar.gz
#Archive all the softwares.
mkdir softwares
mv hadoop-3.3.1.tar.gz softwares
### Set up folder structure.
sudo mv -f hadoop-3.3.1 /opt
### Change the ownership to User.
sudo chown ${USER}:${USER} -R /opt/hadoop-3.3.1
#Create a softlink as /opt/hadoop.
sudo ln -s /opt/hadoop-3.3.1 /opt/hadoop
### Update below three lines in the .profile file.
export HADOOP_HOME=/opt/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-amd64
### Source the .profile or exit the Session.
source .profile
### Validate if the changes are reflected.
echo $JAVA_HOME
echo $HADOOP_HOME
echo $PATH
### core-site.xml : Informs Hadoop where NameNode runs in the cluster.
/opt/hadoop/etc/hadoop/core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
### hdfs-site.xml : Contains the configuration settings for HDFS daemons like the NameNode, the Secondary NameNode, and the DataNodes.
/opt/hadoop/etc/hadoop/hdfs-site.xml
<configuration>
<property>
<name>dfs.namenode.name.dir</name>
<value>/opt/hadoop/dfs/name</value>
</property>
<property>
<name>dfs.namenode.checkpoint.dir</name>
<value>/opt/hadoop/dfs/namesecondary</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/opt/hadoop/dfs/data</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.blocksize</name>
<value>134217728</value>
</property>
</configuration>
### hadoop-env.sh : Contains environment variable settings used by Hadoop.
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-amd64
export HADOOP_OS_TYPE=${HADOOP_OS_TYPE:-$(uname -s)}
### Format Namenode
hdfs namenode -format
ls -ltr /opt/hadoop/dfs/
###Start HDFS components.
start-dfs.sh
#### Validate the HDFS Services are running
jps
### Test HDFS Commands.
hadoop fs -ls /
hadoop fs -mkdir -p /user/${USER}
hadoop fs -ls /user/${USER}
hadoop fs -copyFromLocal test.txt /user/${USER}
###
yarn-site.xml : This file contains the yarn configuration options.
/opt/hadoop/etc/hadoop/yarn-site.xml
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.env-whitelist</name>
<value>HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,JAVA_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
</property>
</configuration>
mapred-site.xml : MapReduce configuration options are stored in this file.
/opt/hadoop/etc/hadoop/mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.application.classpath</name>
<value>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*</value>
</property>
</configuration>
### Start YARN Components.
start-yarn.sh
### Validate YARN Services are running.
jps
### All these start and stop services are available in :
/opt/hadoop/sbin
### Follow this order to stop the services.
1. stop-yarn.sh
2. stop-dfs.sh
3. Stop the Instance
### Follow this order to start the services.
1. start the Instance
2. start-dfs.sh
3. start-yarn.sh
Setup Docker , postgresSql, Hive, Metastore
#####################
## Docker Set up ##
#####################
### Update the existing list of packages.
sudo apt update### Install few prerequisite packages.
sudo apt install apt-transport-https ca-certificates curl software-properties-common### Add the GPG key for the official Docker repository to our host.
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -### Add the Docker repository to APT sources.
sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu bionic stable"### Update the package database with the Docker packages from the newly added repo.
sudo apt update### install the Docker policies
apt-cache policy docker-ce### Install the docker
sudo apt install docker-ce### Validate Docker Instalaltion.
sudo systemctl status docker### Run docker commands with out sudo, we have to add the user to docker group.
sudo usermod -aG docker ${USER}Log out and login to validate.### Validate if user is added to docker group.
id ${USER}
docker images### To check whether you can access and download images from Docker Hub, type:
docker run hello-world########################
## PostgrSQL Set up ##
########################
### Now,Lets Set up Postgress Image in the Docker Container### Create the Container of type Postgress Image.
docker create \
--name postgress_container \
-p 6432:5432 \
-e POSTGRES_PASSWORD=shorya \
postgres
### Start the Container.
docker start postgress_container### Check if the container is running.
docker logs -f postgress_containerhit Control+c To come out.### We can validate if we are able to run postgress from docker.docker exec \
-it postgress_container \
psql -U postgres
### Createa database "metastore" for hive in postgress.
CREATE DATABASE metastore;
CREATE USER hive WITH ENCRYPTED PASSWORD 'shorya';
GRANT ALL ON DATABASE metastore TO hive;\l to list
\q to exit postgress### If you want to access postgress from host, we have to install a postgress client.
sudo apt install postgresql-client -ypsql -h localhost \
-p 6432 \
-d metastore \
-U hive \
-W\d to list tables.
\q to exit########################
### Hive Set up ###
########################
### Download Hive.
wget https://mirrors.ocf.berkeley.edu/apache/hive/hive-3.1.2/apache-hive-3.1.2-bin.tar.gz### Untar Hive File.
tar xzf apache-hive-3.1.2-bin.tar.gz### Archive Hive tar file.
mv apache-hive-3.1.2-bin.tar.gz softwares### Set up Hive folder Structure.
sudo mv -f apache-hive-3.1.2-bin /opt### Create a soft link.
sudo ln -s /opt/apache-hive-3.1.2-bin /opt/hive### Update HIVE_HOME in the .profile File
cd
export HIVE_HOME=/opt/hive
export PATH=$PATH:${HIVE_HOME}/bin### Execute Profile or exit the session and connect again
source .profile###hive-site.xml : Global Configuration File for Hive
/opt/hive/conf/hive-site.xml
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:postgresql://localhost:6432/metastore</value>
<description>JDBC Driver Connection for PostgrSQL</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>org.postgresql.Driver</value>
<description>PostgreSQL metastore driver class name</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value>
<description>Database User Name</description>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>shorya</value>
<description>Database User Password</description>
</property>
</configuration>
### Remove the conflicting Guava Files if present.
rm /opt/hive/lib/guava-19.0.jar
cp /opt/hadoop/share/hadoop/hdfs/lib/guava-27.0-jre.jar /opt/hive/lib/### Download a postgresql jar file and copy it to /opt/hive/lib/
wget https://jdbc.postgresql.org/download/postgresql-42.2.24.jar
sudo mv postgresql-42.2.24.jar /opt/hive/lib/postgresql-42.2.24.jar### Initialize Hive Metastore
schematool -dbType postgres -initSchema### Validate Metadata Tables
docker exec \
-it postgress_container \
psql -U postgres \
-d metastore
\d
\q
Setup spark 2.x/ 3.x
### Update PYSPARK_PYTHON at .profile
export PYSPARK_PYTHON=python3
source .profile### Update /opt/hive/conf/hive-site.xml with below property.
<property>
<name>hive.metastore.schema.verification</name>
<value>false</value>
</property>### Spark Website to Download.
https://downloads.apache.org/spark/### Download Spark
#2.x
wget https://downloads.apache.org/spark/spark-2.4.8/spark-2.4.8-bin-hadoop2.7.tgz
#3.x
wget https://ftp.wayne.edu/apache/spark/spark-3.2.1/spark-3.2.1-bin-hadoop3.2.tgzNote — It is possible that you may find the above link. In that case go to that website “https://ftp.wayne.edu/apache/spark"
and chose an appropriate link.### Untar the File
#2.x
tar xzf spark-2.4.8-bin-hadoop2.7.tgz
#3.x
tar xzf spark-3.2.1-bin-hadoop3.2.tgz### Archive the tar file
#2.x
mv spark-2.4.8-bin-hadoop2.7.tgz softwares
#3.x
mv spark-3.2.1-bin-hadoop3.2.tgz softwares### Set up Spark Folder Structure
#2.x
sudo mv -f spark-2.4.8-bin-hadoop2.7 /opt
#3.x
sudo mv -f spark-3.2.1-bin-hadoop3.2 /opt### Set up Soft Link
#2.x
sudo ln -s spark-2.4.8-bin-hadoop2.7 /opt/spark2
#3.x
sudo ln -s spark-3.2.1-bin-hadoop3.2 /opt/spark3### spark-env.sh
#2.x
Update /opt/spark2/conf/spark-env.sh with below environment variables.
export HADOOP_HOME=”/opt/hadoop”
export HADOOP_CONF_DIR=”/opt/hadoop/etc/hadoop”
export SPARK_DIST_CLASSPATH=$(hadoop — config ${HADOOP_CONF_DIR} classpath)
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HADOOP_HOME/lib/native#3.x
/opt/spark3/conf/spark-env.sh
export HADOOP_HOME=”/opt/hadoop”
export HADOOP_CONF_DIR=”/opt/hadoop/etc/hadoop”### spark-defaults.conf
#2.x
Update /opt/spark2/conf/spark-defaults.conf with below properties.
spark.driver.extraJavaOptions -Dderby.system.home=/tmp/derby/
spark.sql.repl.eagerEval.enabled true
spark.master yarn
spark.eventLog.enabled true
spark.eventLog.dir hdfs:///spark2-logs
spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider
spark.history.fs.logDirectory hdfs:///spark2-logs
spark.history.fs.update.interval 10s
spark.history.ui.port 18081
spark.yarn.historyServer.address localhost:18081
spark.yarn.jars hdfs:///spark2-jars/*.jar#3.x
Update /opt/spark3/conf/spark-defaults.conf with below properties.
spark.driver.extraJavaOptions -Dderby.system.home=/tmp/derby/
spark.sql.repl.eagerEval.enabled true
spark.master yarn
spark.eventLog.enabled true
spark.eventLog.dir hdfs:///spark3-logs
spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider
spark.history.fs.logDirectory hdfs:///spark3-logs
spark.history.fs.update.interval 10s
spark.history.ui.port 18080
spark.yarn.historyServer.address localhost:18080
spark.yarn.jars hdfs:///spark3-jars/*.jar### create directories for logs and jars in HDFS.
#2.x
hdfs dfs -mkdir /spark2-jars
hdfs dfs -mkdir /spark2-logs
#3.x
hdfs dfs -mkdir /spark3-jars
hdfs dfs -mkdir /spark3-logs### Copy Spark jars to HDFS folder as part of spark.yarn.jars.
#2.x
hdfs dfs -put /opt/spark2/jars/* /spark2-jars
#3.x
hdfs dfs -put /opt/spark3/jars/* /spark3-jars### Integrate Spark with Hive Metastore. Create soft link for hive-site.xml in Spark conf folder.
#2.x
sudo ln -s /opt/hive/conf/hive-site.xml /opt/spark2/conf/
#3.x
sudo ln -s /opt/hive/conf/hive-site.xml /opt/spark3/conf/### Install Postgres JDBC jar in Spark jars folder.
#2.x
sudo wget https://jdbc.postgresql.org/download/postgresql-42.2.19.jar \
-O /opt/spark2/jars/postgresql-42.2.19.jar
#3.x
sudo wget https://jdbc.postgresql.org/download/postgresql-42.2.19.jar \
-O /opt/spark3/jars/postgresql-42.2.19.jar### Validate Spark using Scala
#2.x
/opt/spark2/bin/spark-shell — master yarn — conf spark.ui.port=0
#3.x
/opt/spark3/bin/spark-shell — master yarn — conf spark.ui.port=0### Validate Spark using Python
#2.x
/opt/spark2/bin/pyspark — master yarn — conf spark.ui.port=0
#3.x
/opt/spark3/bin/pyspark — master yarn — conf spark.ui.port=0
spark.sql(‘SHOW databases’).show()
spark.sql(‘USE test’)
spark.sql(‘SELECT * FROM spark’).show()
exit()### All the commands(spark-shell,pyspark,spark-submit) are avaiable in:
/opt/spark2/bin
/opt/spark3/bin### Set these commands path at .profile. Add below two lines in .profile
#2.x
export PATH=$PATH:/opt/spark2/bin
#3.x
export PATH=$PATH:/opt/spark3/binsource .profile### Distringuish these commands in spark2 and spark3
#2.x
mv /opt/spark2/bin/pyspark /opt/spark2/bin/pyspark2
#3.x
mv /opt/spark3/bin/pyspark /opt/spark3/bin/pyspark3### Validate the commands
#2.x
pyspark2 — master yarn
#3.x
pyspark3 — master yarn############ END ############## Fix Python Warning (/opt/spark2/bin/pyspark2: line 45: python: command not found)
#2.x
cd /opt/spark2/bin
vi pyspark2
Edit the line
WORKS_WITH_IPYTHON=$(python -c 'import sys; print(sys.version_info >= (2, 7, 0))')
To
WORKS_WITH_IPYTHON=$(python3 -c 'import sys; print(sys.version_info >= (2, 7, 0))')### Set spark-submit Files. we can use spark-submit command to launch applications or job on a cluster.
#2.x
cp /opt/spark2/bin/spark-submit spark2-submit#3.x
No such errors in PySpark 3.x--Create a Spark Job
cd
vi basic.py
print("Start ...")
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.master('yarn') \
.appName("Python Spark SQL basic example") \
.getOrCreate()spark.sparkContext.setLogLevel('OFF')
print("Spark Object is created")
print("Spark Version used is :" + spark.sparkContext.version)print("... End")-- Submit the Job to the Cluster
spark2-submit --master yarn /home/${USER}/basic.py-- Turn off the INFO logs. Spark uses log4j for logging.
cp /opt/spark2/conf/log4j.properties.template /opt/spark2/conf/log4j.properties
set log4j.rootCategory=INFO, console
to
log4j.rootCategory=WARN, consolespark2-submit --master yarn /home/${USER}/basic.py### Resolve Class path contains multiple SLF4J bindings
mv /opt/spark-2.4.8-bin-hadoop2.7/jars/slf4j-log4j12-1.7.16.jar /home/${USER}/softwaresspark2-submit --master yarn /home/${USER}/basic.py#3.x
cp /opt/spark3/bin/spark-submit /opt/spark3/bin/spark3-submit-- Submit the Job to the Cluster
spark3-submit --master yarn /home/${USER}/basic.py-- Turn off the INFO logs
cp /opt/spark3/conf/log4j.properties.template /opt/spark2/conf/log4j.properties
set log4j.rootCategory=INFO, console
to
log4j.rootCategory=WARN, consolespark3-submit --master yarn /home/${USER}/basic.py################ END ########################