SPARK RDDs

4 min readDec 31, 2021

In this article we will go through the concept of RDDs and some of the important transformations and actions on RDDs using Pyspark. RDDs are one the most important concepts to grab if someone is starting up with spark framework.

source: https://cdn.analyticsvidhya.com/wp-content/uploads/2020/10/Screenshot-from-2020-10-26-16-33-40.png

This is a follow along tutorial and is useful for anyone who wants to learn about RDDs. I have used community version of Databricks for implementing the code you may use local PySpark also.

What are RDDs ?

RDDs are the sparks core abstractions which stands for Resilient distributed datasets. RDD is the immutable collections of objects, Internally spark distributes the data in RDD, to different nodes across the cluster to achieve parallelization.

source : https://static.javatpoint.com/tutorial/pyspark/images/pyspark-rdd2.png

Transformations and Actions

▪Transformations create a new RDD from an existing one.

▪Actions return a value to the driver program after running a computation on the RDD

▪All transformations in Spark are lazy

▪Spark only triggers the data flow when there’s a action

source: https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQ9BgMn7YQySzuWtnbuRCBQjgxBK8iobC4BpQ&usqp=CAU

CREATING A RDD

create a sample.txt file

2. read it using spark

map()

▪Map is used as a mapper of data from one state to other

▪It will create a new RDD

▪rdd.map(lambda x: x.split())

map using lambda

map using simple functions

flatMap()

▪Flat Map is used as a maper of data and explodes data before final output

▪It will create a new RDD

▪rdd.flatMap(lambda x: x.split())

filter()

▪Filter is used to remove the elements from the RDD

▪It will create a new RDD

▪rdd.filter(lambda x: x != 123)

distinct()

▪Distinct is used to get the distinct elements in RDD

▪It will create a new RDD

▪rdd.distinct()

groupByKey()

▪GroupByKey is used to create groups based on Keys in RDD

▪For groupByKey to work properly the data must be in the format of (k,v), (k,v), (k2,v), (k2,v2)

▪Example: (“Apple”,1), (“Ball”,1), (“Apple”,1)

▪It will create a new RDD

▪rdd.groupByKey()

▪mapValues(list) are usually used to get the group data

reduceByKey()

▪ReduceByKey is used to combined data based on Keys in RDD

▪For reduceByKey to work properly the data must be in the format of (k,v), (k,v), (k2,v), (k2,v2)

▪Example: (“Apple”,1), (“Ball”,1), (“Apple”,1)

▪It will create a new RDD

▪rdd.reduceByKey(lambda x, y: x + y)

count()

▪count returns the number of elements in RDD

▪count is an action

▪rdd.count()

countByValue()

▪CountByValue provide how many times each value occur in RDD

▪countByValue is an action

▪rdd.countByValue()

saveAsTextFile()

▪SaveAsTextFile is used to save the RDD in the file

▪saveAsTextFile is an action

▪rdd.saveAsTextFile(‘path/to/file/filename.txt’)

repartition()

▪Repartition is used to change the number of partitions in RDD

▪It will create a new RDD

▪rdd.repartition(number_of_partitions)

I hope you were able to grasp the concept and practical knowledge of RDDs.

There are other articles that I have written which might come handy if you want to know more about spark.

Spark In Depth

Data is increasing in Volume, Velocity and Variety. There is a blast of information. Regardless of where you look…

sharmashorya1996.medium.com

PySpark With Examples

Hello again! I am thrilled to know you showed interest in one of the leading technologies out there for big data. The…

sharmashorya1996.medium.com

How to use Hive and MySql in Pyspark along with some handy transformations

Hello again !! So you have a keen interest in PySpark that bought you here, or possibly some requirement at your…

sharmashorya1996.medium.com

SPARK RDDs

What are RDDs ?

Transformations and Actions

Spark In Depth

Data is increasing in Volume, Velocity and Variety. There is a blast of information. Regardless of where you look…

PySpark With Examples

Hello again! I am thrilled to know you showed interest in one of the leading technologies out there for big data. The…

How to use Hive and MySql in Pyspark along with some handy transformations

Hello again !! So you have a keen interest in PySpark that bought you here, or possibly some requirement at your…

Written by shorya sharma