Difference between groupByKey and reduceByKey in Spark

groupByKey and reduceByKey are the two different operations that help to transform RDD (Resilient Distributed Datasets). What is the difference between groupByKey and reduceByKey and their importance?

In Spark, groupByKey and reduceByKey are both transformations used to group elements in an RDD by key. groupByKey groups all elements with the same key into a single pair, while reduceByKey groups the elements by key and applies a reduction function to the values, producing a new RDD with reduced values for each key.

Difference between groupByKey and reduceByKey in Spark

Introduction

Spark is a distributed computing framework that uses in-memory computation to process large-scale data faster. One of the key features of Spark is the ability to work with distributed collections of data called RDDs (Resilient Distributed Datasets).

RDDs can be transformed into new RDDs through operations like map, filter, reduce, groupByKey, reduceByKey, and many others. Let’s continue to read to know more about groupByKey and reduceByKey, two important transformations that operate on key-value pairs in an RDD.

GroupByKey

groupByKey is a transformation in Spark that groups the values for each key in the RDD (Key,Value) and returns a new RDD (Key, Iterable) where each element is a key and its corresponding list of values. This can be useful when you want to perform some operation on all the values for a given key, such as computing the average of a set of values.

Let’s take an example where we have an RDD representing student scores. The RDD contains tuples where the key is the name of the subject and the value is the score obtained by the student in that subject. We can use groupByKey to group the scores by subject:

scores = [("math", 80), ("science", 90), ("math", 70), ("science", 85), ("english", 75)] 

rdd = sc.parallelize(scores) 

grouped_scores = rdd.groupByKey()

grouped_scores.toDF().show()

Output:

+-------+----------------+                                                      
|     _1|              _2|
+-------+----------------+
|   math|[[70, 80], 0, 2]|
|science|[[90, 85], 0, 2]|
|english|    [[75], 0, 1]|
+-------+----------------+

ReduceByKey

reduceByKey is a transformation in Spark that groups the values for each key in the RDD and applies a specified reduce function to the values for each key, returning a new RDD where each element is a key and its corresponding reduced value. This can be more efficient than groupByKey because it performs the reduce operation on the values before shuffling the data across the network, potentially reducing the amount of data transferred.

Continuing with our student scores example, we can use reduceByKey to compute the total score obtained by each student:

scores = [("math", 80), ("science", 90), ("math", 70), ("science", 85), ("english", 75)] 

rdd = sc.parallelize(scores) 

total_scores = rdd.reduceByKey(lambda x, y: x + y)

total_scores.toDF().show()

Output:

+-------+---+                                                                   
|     _1| _2|
+-------+---+
|   math|150|
|science|175|
|english| 75|
+-------+---+

Here, the lambda x, y: x + y the function is used to compute the sum of the scores for each subject. As you can see, the reduceByKey example sums up the scores obtained by each student for each subject.

Internal Flow GroupByKey vs ReduceByKey

A step-by-step guide, on how the GroupByKey & ReduceByKey work internally

GroupByKey

  • Partitioning: The input RDD is partitioned into multiple partitions (by default, the number of partitions is determined automatically based on the available resources). Each partition contains a subset of the data.
Partition 1 Partition 2
“math”, 80“math”, 70
“science”, 90“science”, 85
“english”, 75
  • Shuffling: The data is shuffled across the partitions based on the key of each element. This involves transferring the data across the network to ensure that all elements with the same key are in the same partition.
Task1Task 2
“math”, 80“science”, 90
“math”, 70“english”, 75
“science”, 85
  • Grouping: Elements are grouped by their keys within the partition. The resulting groups are stored in memory as iterators.
Task1Task 2
“math”, (80,70)“science”, (90,85)
“english”, 75
  • Merging: The groups from each partition are merged together into a single group for each unique key. This involves transferring data across the network to ensure that all groups with the same key are combined.
“math”, 80,70
“science”, 90,85
“english”, 75
NOTE: Example is for 2 partitions, Let’s say have a big cluster, There will be still the same key present in different machines which will be merged in this step
  • Output: The resulting RDD contains the grouped elements (Key, Iterable), with each key mapped to a sequence of values.

If you compare the above code example for GroupByKey, We can see the output below, Where we can see each key (eg: Math) is mapped to a sequence of values (70,80)

+-------+----------------+                                                      
|     _1|              _2|
+-------+----------------+
|   math|[[70, 80], 0, 2]|
|science|[[90, 85], 0, 2]|
|english|    [[75], 0, 1]|
+-------+----------------+

ReduceByKey

  • Partitioning: The input RDD is partitioned into multiple partitions (by default, the number of partitions is determined automatically based on the available resources). Each partition contains a subset of the data.
Partition 1Partition 2
“math”, 80“math”, 70
“science”, 90“science”, 85
“english”, 75
  • Shuffling: The data is shuffled across the partitions based on the key of each element. This involves transferring the data across the network to ensure that all elements with the same key are in the same partition.
Task1Task 2
“math”, 80“science”, 90
“math”, 70“english”, 75
“science”, 85
  • Local reduction: Within each partition, the reduction function is applied to the values associated with each key. This produces a set of intermediate results.
Task1Task 2
“math”, 150“science”, 175
“english”, 75
  • Merging: The intermediate results from each partition are merged together into a single result for each unique key. This involves transferring data across the network to ensure that all intermediate results with the same key are combined.
“math”, 150
“science”, 175
“english”, 75
NOTE: Example is for 2 partitions, Let’s say have a big cluster, There will be still the same key present in different machines which will be merged in this step
  • Output: The resulting RDD contains the reduced values for each key.

If you compare the above code example for ReduceByKey, We can see the output below, Where we can see each key (eg: Math) contains the reduced value (70+80 => 150)

+-------+---+                                                                   
|     _1| _2|
+-------+---+
|   math|150|
|science|175|
|english| 75|
+-------+---+

Conclusion

In conclusion, groupByKey and reduceByKey are two transformations in Spark that operate on key-value pairs in an RDD. groupByKey groups the values for each key in the RDD and returns a new RDD where each element is a key and its corresponding list of values, while reduceByKey groups the values for each key in the RDD and applies a specified reduce function to the values for each key, returning a new RDD where each element is a key and its corresponding reduced value. Understanding these transformations is crucial for working with key-value pairs in Spark and for writing efficient Spark programs.

Good Luck with your Learning!!

Related Topics:

Difference between map and mapValues functions in Spark

How to read and write XML files using Spark?

Difference between map and flatMap in Spark

How to Access HBase from Spark

Similar Posts