Handling Data Skew in Apache Spark Application
What is Data skew?
Let’s take a basic example of “CONSTRUCTION WORKERS“
- Let’s assume a worker can able to build a house in 1 day.
- So, If we ask 2 workers to build 1 house each and they can able to build 2 houses in a day and the total time to complete the work (2 houses) would be 1 day
- When we ask 1 worker to build 1 house and another worker to build 2 houses, then the total time to complete the work would be 2 days. This is called Skew.
In the above example: Skew happened due to improper distribution of work. Distributing evenly would get us better results.
In spark,
Aggregation would happen based on the “Key/Partition” -> If the data is not distributed evenly across all the Partition/Key, We would end up seeing some workers doing more work than others.
How to find data skew:
Create a table with unevenly distributed data across partitions:
Click here to create your own skewed table:
scala> spark.sql("select id,count(col1) from test.tbl group by id").show() +---+-----------+ | id|count(col1)| +---+-----------+ | 1| 10000| | 6| 20| | 4| 20| | 7| 20| | 11| 30000| | 2| 10| +---+-----------+
The above table shows that some of the partitions have more data than others. But in a typical real-time example, Checking this directly would be a tedious task.
Find skew using Spark event logs (Spark history server)
-> Download and load the Spark event log to a local spark history server: Click here to set up a local spark history server
-> Go to application stages in the Spark history server (where we are seeing the slowness)
-> Check the “Summary Metrics”

-> In the above screenshot, let’s compare the min and max input records count,
Min: 10 records
Max: 19316 records
This shows us, There are some tasks that ran with just 10 records and some ran with 19316. In this way, we can identify the skewness in the data.
-> We can able to validate each task and the input record count and how it gets distributed across task as shown below

Solution:
— Effective planning on the partition column and making sure it is distributed evenly
— Also, we can try various options like repartition but these would come with some performance cost
Let us know if you have any unique ways to find the data skew and a solution for it.