SPARK RDDs

shorya sharma
4 min readDec 31, 2021

--

In this article we will go through the concept of RDDs and some of the important transformations and actions on RDDs using Pyspark. RDDs are one the most important concepts to grab if someone is starting up with spark framework.

source: https://cdn.analyticsvidhya.com/wp-content/uploads/2020/10/Screenshot-from-2020-10-26-16-33-40.png

This is a follow along tutorial and is useful for anyone who wants to learn about RDDs. I have used community version of Databricks for implementing the code you may use local PySpark also.

What are RDDs ?

RDDs are the sparks core abstractions which stands for Resilient distributed datasets. RDD is the immutable collections of objects, Internally spark distributes the data in RDD, to different nodes across the cluster to achieve parallelization.

source : https://static.javatpoint.com/tutorial/pyspark/images/pyspark-rdd2.png

Transformations and Actions

▪Transformations create a new RDD from an existing one.

▪Actions return a value to the driver program after running a computation on the RDD

▪All transformations in Spark are lazy

▪Spark only triggers the data flow when there’s a action

source: https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQ9BgMn7YQySzuWtnbuRCBQjgxBK8iobC4BpQ&usqp=CAU

CREATING A RDD

  1. create a sample.txt file

2. read it using spark

map()

▪Map is used as a mapper of data from one state to other

▪It will create a new RDD

▪rdd.map(lambda x: x.split())

map using lambda

map using simple functions

flatMap()

▪Flat Map is used as a maper of data and explodes data before final output

▪It will create a new RDD

▪rdd.flatMap(lambda x: x.split())

filter()

▪Filter is used to remove the elements from the RDD

▪It will create a new RDD

▪rdd.filter(lambda x: x != 123)

distinct()

▪Distinct is used to get the distinct elements in RDD

▪It will create a new RDD

▪rdd.distinct()

groupByKey()

▪GroupByKey is used to create groups based on Keys in RDD

▪For groupByKey to work properly the data must be in the format of (k,v), (k,v), (k2,v), (k2,v2)

▪Example: (“Apple”,1), (“Ball”,1), (“Apple”,1)

▪It will create a new RDD

▪rdd.groupByKey()

▪mapValues(list) are usually used to get the group data

reduceByKey()

▪ReduceByKey is used to combined data based on Keys in RDD

▪For reduceByKey to work properly the data must be in the format of (k,v), (k,v), (k2,v), (k2,v2)

▪Example: (“Apple”,1), (“Ball”,1), (“Apple”,1)

▪It will create a new RDD

▪rdd.reduceByKey(lambda x, y: x + y)

count()

▪count returns the number of elements in RDD

▪count is an action

▪rdd.count()

countByValue()

▪CountByValue provide how many times each value occur in RDD

▪countByValue is an action

▪rdd.countByValue()

saveAsTextFile()

▪SaveAsTextFile is used to save the RDD in the file

▪saveAsTextFile is an action

▪rdd.saveAsTextFile(‘path/to/file/filename.txt’)

repartition()

▪Repartition is used to change the number of partitions in RDD

▪It will create a new RDD

▪rdd.repartition(number_of_partitions)

--

--

shorya sharma
shorya sharma

Written by shorya sharma

Assistant Manager at Bank Of America | Ex-Data Engineer at IBM | Ex - Software Engineer at Compunnel inc. | Python | Data Science

No responses yet