How to Create Skewed Table with Apache Hive
– Use this for testing purposes and to experiment on data skew
– We are creating 2 tables one with the actual partition table other to generate random records
Create table testing.tbl (col1 string) partitioned by (p1 int); Create table testing.auto as select explode(split(repeat("Learn skew",30000),"d")) as col1
– Insert the random records with static partition value, This helps to distributed the records unevenly
insert into table testing.tbl partition (p1) SELECT col1,1 as p1 from testing.auto limit 20; insert into table testing.tbl partition (p1) SELECT col1,2 as p1 from testing.auto limit 10; insert into table testing.tbl partition (p1) SELECT col1,3 as p1 from testing.auto limit 10; insert into table testing.tbl partition (p1) SELECT col1,4 as p1 from testing.auto limit 20; insert into table testing.tbl partition (p1) SELECT col1,5 as p1 from testing.auto limit 30; insert into table testing.tbl partition (p1) SELECT col1,6 as p1 from testing.auto limit 40; insert into table testing.tbl partition (p1) SELECT col1,7 as p1 from testing.auto; insert into table testing.tbl partition (p1) SELECT col1,8 as p1 from testing.auto limit 10; insert into table testing.tbl partition (p1) SELECT col1,9 as p1 from testing.auto limit 10; insert into table testing.tbl partition (p1) SELECT col1,10 as p1 from testing.auto limit 5;
– Output of the skew table, We can see the partition are not evenly distributed. We can see partition with id “11” has 30k records and partition with id “2” has 10 records. So which ever task working on the id “11” would take more time than id “2”
scala> spark.sql("select id,count(col1) from test.tbl group by id").show() +---+-----------+ | id|count(col1)| +---+-----------+ | 1| 10000| | 6| 20| | 4| 20| | 7| 20| | 11| 30000| | 2| 10| +---+-----------+
Check here on how to handle & find Data skew in spark
Let us know if you have any other interesting way to generate a data skew table 🙂