Difference between map and mapValues functions in Spark

Spark is a distributed framework, Which uses in-memory computation power to process a large volume of data much faster. One of Spark’s most widely used features is transforming RDDs (Resilient Distributed Datasets) or dataframes using various functions. In this article, We are going to learn the difference between map and mapValues functions

In Spark, the map() function transforms each element of an RDD, while the mapValues() function transforms only the values of key-value pairs in an RDD, leaving the keys unchanged. map() can be used to transform any RDD, while mapValues() is specifically for key-value RDDs.

Difference between map and mapValues functions in Spark

What is map() Function in Spark?

The map() function is used to transform each and every element of the RDD and returns a new RDD with the transformed values. In this function, it takes in a value and returns a new transformed value.

NOTE: map() function Output RDD will always contain the same number of elements as the input RDD

With map() we will also have the choice to access both key and value So that you can transform both the key and value based on our need

Example:

Let’s say you have an RDD of integers from 1 to 5 and you like to create a new RDD with squares of these integers, This can be easily achieved using map() function

numbers = sc.parallelize([1, 2, 3, 4, 5])

#map transformation to square numbers
squares = numbers.map(lambda x: x*x)

# print the resulting RDD 
print(squares.collect())

Output:


[1, 4, 9, 16, 25]
>>>

In the above example, We have created an RDD numbers container integer from 1 to 5. Then we applied map() function to this RDD, Which squared each integer in the RDD and returned a new RDD with all the square values on the initial values.

What is mapValues() Function in Spark?

The mapValues() function is used for transformation operations, where it applies a function specifically to the values of key-value pairs in an RDD or a dataframe. The result will be a new RDD or dataframe that has the same keys, but with transformed values.

mapValues() Function will not have access to the key and can only have access to value from the key-value pair, This is useful in some specific use cases, Where you want to transform only the values of an RDD without changing the keys.

Example:

Let’s say you have an RDD of key-value pairs that represent the sales data for different products. whereas the product ID is the key, and the product price is the value. You wanted to apply a 10% discount on the product price.

Now, For this use case, you can use simply apply mapValues() function to alter the values of the product, and below are the sample code

salesData = sc.parallelize([("product1", 100.0), ("product2", 200.0), ("product3", 300.0)]) 

# mapValues transformation to apply 10% discount on sales 

discountedSalesData = salesData.mapValues(lambda sales: sales * 0.9) 

# print the resulting RDD 

print(discountedSalesData.collect())

Output:

[('product1', 90.0), ('product2', 180.0), ('product3', 270.0)]

In the above example, we created an RDD “salesData” containing the product Id and price for all the products. Then we applied the mapValues() function to this RDD “salesData”, which multiplied each sales figure by 0.9 to apply a 10% discount and returned a new RDD “discountedSalesData” with the discounted sales numbers but with the same product IDs.

Differences between map() and mapValues() Functions in Spark

The map() and mapValues() function in Spark have the following differences:

  1. map() function applies a function to each element of an RDD or a dataframe, whereas mapValues() function can only be applied to values of key-value pairs in an RDD or a dataframe.
  2. map() function returns a new RDD or dataframe with the transformed values, whereas mapValues() function returns a new RDD or dataframe with the same keys but with transformed values.
  3. map() function can be used for both RDD and dataframes, whereas mapValues() function is applicable only for RDDs with key-value pairs.
  4. map() function allows us to access both the key and the value of each element in an RDD, whereas mapValues() function can only access the value of each key-value pair.
  5. map() function can be used to perform any transformation on an RDD or a dataframe, whereas mapValues() function is useful only when we want to transform the values of an RDD without changing the keys.

Real time scenario for both map() and mapValues()

Example of map() function

Let’s think about a scenario, Where we have an RDD containing the temperatures in Fahrenheit for different cities and We want to convert the temperatures to Celsius. In this scenario, We can use the map() function to transform each temperature value in the RDD by applying a conversion formula.

# Create an RDD with temperature values in Fahrenheit 

temperatures = sc.parallelize([("New York", 70), ("Los Angeles", 80), ("Chicago", 65)]) 

# Define a function to convert Fahrenheit to Celsius 

def fahrenheit_to_celsius(temperature):
    return (temperature - 32) * 5 / 9.0 

# Apply the map function to convert temperatures to Celsius 

celsius_temperatures = temperatures.map(lambda x: (x[0], fahrenheit_to_celsius(x[1]))) 

# Print the converted temperatures 

print(celsius_temperatures.collect())

Output

[('New York', 21.11111111111111), ('Los Angeles', 26.666666666666668), ('Chicago', 18.333333333333332)]

In the above example, we first created RDD temperatures with temperature values in Fahrenheit for different cities. Then we defined a function fahrenheit_to_celsius() that converts a Fahrenheit temperature to Celsius. We applied the map() function to the temperatures RDD to convert each temperature value from Fahrenheit to Celsius using the fahrenheit_to_celsius() function. At last, we printed the converted temperatures

Example of mapValues() function

Let’s use the previous example of mapValues with a few more conditions like applying the 10% discount on selected customers. We can use the mapValues() function to transform only the sales figures without changing the product IDs.

purchases = sc.parallelize([(101, 50), (102, 75), (103, 100), (104, 200), (105, 150)])

# set of customer IDs eligible for discount

eligible_customers = {102, 104, 105}

# use mapValues() to apply discount to eligible customers

discounted_purchases = purchases.mapValues(lambda x: x*0.9 if x > 50 and x <= 200 else x)

print(discounted_purchases.collect())

Output:

[(101, 50), (102, 67.5), (103, 90.0), (104, 180.0), (105, 135.0)]

In the above example, The mapValues() function applies the lambda function lambda x: x*0.9 if x > 50 and x <= 200 else x to the purchase amounts in the RDD purchases, but only for the customers whose IDs are in eligible_customers.

The resulting RDD discounted_purchases is a new RDD with the same number of elements as purchases, but with the purchase amounts transformed by the discount for eligible customers. Note that the customer IDs remain unchanged.

Conclusion

In summary, map() and mapValues() functions in Spark are used for transforming RDDs or dataframes by applying a function to each element or each value of a key-value pair, respectively. While map() applies the function to each element, mapValues() applies the function only to the values of the key-value pairs. map() returns a new RDD or dataframe with transformed values, whereas mapValues() returns a new RDD or dataframe with the same keys but with transformed values. map() can be used for both RDDs and dataframes, whereas mapValues() is applicable only for RDDs with key-value pairs.

Knowing the difference between map() and mapValues() functions is important when working with Spark, as it helps us to choose the right function for the task at hand and write efficient and optimized code.

Related Topic:
Difference between map and flatMap in Spark

Good Luck with your Learning!

Similar Posts