How to Access HBase from Spark

As we all know, Spark is an open-source, distributed processing framework used in big data, It helps perform analytics on top of data of any size. On the other hand, HBase is an open-source, NoSQL database it enabled big data distributed storage.

By combining HBase and Spark, it would become a more scalable and high-performance data processing system. In this article, we’ll discuss how to access HBase from Spark, including the necessary steps to add the HBase Spark library, configure your Spark application, load data from HBase into a Spark DataFrame, and perform Spark operations on the data.

Access Hbase from spark
Connect

Step1: Create an HBase table

We can use the HBase shell command to create a table and below are the example

hbase shell
create 'students','c'

Example:

$ hbase shell
HBase Shell
Use "help" to get list of supported commands.
Use "exit" to quit this interactive shell.
For Reference, please visit: http://hbase.apache.org/2.0/book.html#shell
Version 2.4.6.7.2.16.0-287, refbe5f5f26e5e97d5f7de6d531b617e67e198edf, Mon Jan  9 13:20:07 UTC 2023
Took 0.0032 seconds                                                                                                                                 
hbase:001:0> create 'students','c'
Created table students
Took 1.9662 seconds                                                                                                                 
=> Hbase::Table - students
hbase:002:0> 

Step 2: Add the HBase Spark Libraries

In this example, We are adding all the dependent jars and configurations in the spark-submit/spark-shell command as below

For Cloudera Distribution

spark-shell --master yarn --executor-cores 5 --deploy-mode client --jars /opt/cloudera/parcels/CDH/lib/hbase/*.jar --files /etc/hbase/conf/hbase-site.xml

For HDP Distribution

spark-shell --master yarn --executor-cores 5 --deploy-mode client \
--jars /usr/hdp/current/hbase-client/lib/hbase*.jar
--files /etc/hbase/conf/hbase-site.xml

If you are using Apache Hadoop, You can add all the Hbase jars and hbase-site.xml in the classpath or use –jars options as above

Step 3: Configure Your Spark Application

After adding the dependent libraries, you’ll need to configure your Spark application to use the HBase jars to connect Hbase. You can do this by setting the following properties in your SparkConf object:

val conf = HBaseConfiguration.create()
conf.addResource(new Path("/etc/hbase/conf/hbase-site.xml"))
new HBaseContext(spark.sparkContext, conf)

Step 4: Write data to the HBase table

Once you have configured your spark application to use all the Hbase jars and configuration, Can we proceed with loading the data into the HBase table

import org.apache.hadoop.hbase.spark.HBaseContext
import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.fs.Path

val conf = HBaseConfiguration.create()
conf.addResource(new Path("/etc/hbase/conf/hbase-site.xml"))
new HBaseContext(spark.sparkContext, conf)

case class Employee(id:Long, name: String, age: Integer, mark: Float)

import spark.implicits._
var studentDS = Seq(
Student(1, "student1", 13, 400.5f),
Student(2, "student2", 12, 450.0f),
Student(3, "student3", 14, 498.5f)
).toDS()

val columnMapping = "id Long :key, name STRING c:name, age Integer c:age, mark FLOAT c:mark"
val format = "org.apache.hadoop.hbase.spark"
val tableName = "students"

// write the data to hbase table
studentDS.write.format(format).option("hbase.columns.mapping",columnMapping).option("hbase.table", tableName).save()

In the above example, We have created a sample dataset that consists of student information like Id, Name, age, and mark. Once it is created, we can load the dataset to the HBase table using “studentDS.write.format” option

Step 5: Read data from the HBase table

Once we have data written on the Hbase table, We can read the table data which can be loaded into a Spark DataFrame.

// read the data from hbase table
val df = spark.read.format(format).option("hbase.columns.mapping",columnMapping).option("hbase.table", tableName).load()
df.show(truncate=false)

Step 6: Perform Spark Operations on the Data

Once you have loaded the data into a DataFrame, you can perform any Spark operations on it, and it’s just like any other regular DataFrame.

For example, you can filter the data, aggregate it, join it with other DataFrames, or write it to another data store.

Final code (which covers all the above steps)

$ spark-shell --master yarn --executor-cores 5 --deploy-mode client --jars /opt/cloudera/parcels/CDH/lib/hbase/*.jar --files /etc/hbase/conf/hbase-site.xml
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/03/30 17:04:16 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors before the AM has registered!
Spark context Web UI available at http://localhost:4040
Spark context available as 'sc' (master = yarn, app id = application_1680147871532_0001).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.0
      /_/
         
Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_141)
Type in expressions to have them evaluated.
Type :help for more information.

scala> import org.apache.hadoop.hbase.spark.HBaseContext
import org.apache.hadoop.hbase.spark.HBaseContext

scala> import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.hbase.HBaseConfiguration

scala> import org.apache.hadoop.fs.Path
import org.apache.hadoop.fs.Path

scala> val conf = HBaseConfiguration.create()
conf: org.apache.hadoop.conf.Configuration = Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml, hbase-default.xml, hbase-site.xml

scala> conf.addResource(new Path("/etc/hbase/conf/hbase-site.xml"))

scala> new HBaseContext(spark.sparkContext, conf)
res1: org.apache.hadoop.hbase.spark.HBaseContext = org.apache.hadoop.hbase.spark.HBaseContext@2ba0b7cf

scala> case class Student(id:Long, name: String, age: Integer, mark: Float)
defined class Student

scala> import spark.implicits._
import spark.implicits._

scala> var studentDS = Seq(
     | Student(1, "student1", 13, 400.5f),
     | Student(2, "student2", 12, 450.0f),
     | Student(3, "student3", 14, 498.5f)
     | ).toDS()
studentDS: org.apache.spark.sql.Dataset[Student] = [id: bigint, name: string ... 2 more fields]

scala> val columnMapping = "id Long :key, name STRING c:name, age Integer e:age, mark FLOAT c:mark"
columnMapping: String = id Long :key, name STRING c:name, age Integer c:age, mark FLOAT c:mark

scala> val format = "org.apache.hadoop.hbase.spark"
format: String = org.apache.hadoop.hbase.spark

scala> val tableName = "students"
tableName: String = students

scala> // write the data to hbase table

studentDS.write.format(format).option("hbase.columns.mapping",columnMapping).option("hbase.table", tableName).save()
id Long :key, name STRING c:name, age Integer c:age, mark FLOAT c:mark
23/03/30 17:05:36 WARN output.FileOutputCommitter: Output Path is null in setupJob()
23/03/30 17:05:46 WARN output.FileOutputCommitter: Output Path is null in commitJob()

scala> // read the data from hbase table

scala> val df = spark.read.format(format).option("hbase.columns.mapping",columnMapping).option("hbase.table", tableName).load()
id Long :key, name STRING c:name, age Integer c:age, mark FLOAT c:mark
df: org.apache.spark.sql.DataFrame = [mark: float, age: int ... 2 more fields]

scala> df.show(truncate=false)
+-----+---+--------+---+
|mark |age|name    |id |
+-----+---+--------+---+
|400.5|13 |student1|1  |
|450.0|12 |student2|2  |
|498.5|14 |student3|3  |
+-----+---+--------+---+

Conclusion

Accessing HBase from Spark is a powerful combination that helps us to process and analyze Big data with ease. By following the steps outlined in this article, you can quickly and easily load data from/to HBase using Spark, and then perform Spark operations on it.

Good Luck with your Learning !!

Similar Posts