SPARK RDDs
In this article we will go through the concept of RDDs and some of the important transformations and actions on RDDs using Pyspark. RDDs are one the most important concepts to grab if someone is starting up with spark framework.
This is a follow along tutorial and is useful for anyone who wants to learn about RDDs. I have used community version of Databricks for implementing the code you may use local PySpark also.
What are RDDs ?
RDDs are the sparks core abstractions which stands for Resilient distributed datasets. RDD is the immutable collections of objects, Internally spark distributes the data in RDD, to different nodes across the cluster to achieve parallelization.
Transformations and Actions
▪Transformations create a new RDD from an existing one.
▪Actions return a value to the driver program after running a computation on the RDD
▪All transformations in Spark are lazy
▪Spark only triggers the data flow when there’s a action
CREATING A RDD
- create a sample.txt file
2. read it using spark
map()
▪Map is used as a mapper of data from one state to other
▪It will create a new RDD
▪rdd.map(lambda x: x.split())
map using lambda
map using simple functions
flatMap()
▪Flat Map is used as a maper of data and explodes data before final output
▪It will create a new RDD
▪rdd.flatMap(lambda x: x.split())
filter()
▪Filter is used to remove the elements from the RDD
▪It will create a new RDD
▪rdd.filter(lambda x: x != 123)
distinct()
▪Distinct is used to get the distinct elements in RDD
▪It will create a new RDD
▪rdd.distinct()
groupByKey()
▪GroupByKey is used to create groups based on Keys in RDD
▪For groupByKey to work properly the data must be in the format of (k,v), (k,v), (k2,v), (k2,v2)
▪Example: (“Apple”,1), (“Ball”,1), (“Apple”,1)
▪It will create a new RDD
▪rdd.groupByKey()
▪mapValues(list) are usually used to get the group data
reduceByKey()
▪ReduceByKey is used to combined data based on Keys in RDD
▪For reduceByKey to work properly the data must be in the format of (k,v), (k,v), (k2,v), (k2,v2)
▪Example: (“Apple”,1), (“Ball”,1), (“Apple”,1)
▪It will create a new RDD
▪rdd.reduceByKey(lambda x, y: x + y)
count()
▪count returns the number of elements in RDD
▪count is an action
▪rdd.count()
countByValue()
▪CountByValue provide how many times each value occur in RDD
▪countByValue is an action
▪rdd.countByValue()
saveAsTextFile()
▪SaveAsTextFile is used to save the RDD in the file
▪saveAsTextFile is an action
▪rdd.saveAsTextFile(‘path/to/file/filename.txt’)
repartition()
▪Repartition is used to change the number of partitions in RDD
▪It will create a new RDD
▪rdd.repartition(number_of_partitions)
I hope you were able to grasp the concept and practical knowledge of RDDs.
There are other articles that I have written which might come handy if you want to know more about spark.