How to Access HBase from Spark
As we all know, Spark is an open-source, distributed processing framework used in big data, It helps perform analytics on top of data of any size. On the other hand, HBase is an open-source, NoSQL database it enabled big data distributed storage.
By combining HBase and Spark, it would become a more scalable and high-performance data processing system. In this article, we’ll discuss how to access HBase from Spark, including the necessary steps to add the HBase Spark library, configure your Spark application, load data from HBase into a Spark DataFrame, and perform Spark operations on the data.

Step1: Create an HBase table
We can use the HBase shell command to create a table and below are the example
hbase shell create 'students','c'
Example:
$ hbase shell HBase Shell Use "help" to get list of supported commands. Use "exit" to quit this interactive shell. For Reference, please visit: http://hbase.apache.org/2.0/book.html#shell Version 2.4.6.7.2.16.0-287, refbe5f5f26e5e97d5f7de6d531b617e67e198edf, Mon Jan 9 13:20:07 UTC 2023 Took 0.0032 seconds hbase:001:0> create 'students','c' Created table students Took 1.9662 seconds => Hbase::Table - students hbase:002:0>
Step 2: Add the HBase Spark Libraries
In this example, We are adding all the dependent jars and configurations in the spark-submit/spark-shell command as below
For Cloudera Distribution
spark-shell --master yarn --executor-cores 5 --deploy-mode client --jars /opt/cloudera/parcels/CDH/lib/hbase/*.jar --files /etc/hbase/conf/hbase-site.xml
For HDP Distribution
spark-shell --master yarn --executor-cores 5 --deploy-mode client \
--jars /usr/hdp/current/hbase-client/lib/hbase*.jar
--files /etc/hbase/conf/hbase-site.xml
If you are using Apache Hadoop, You can add all the Hbase jars and hbase-site.xml in the classpath or use –jars options as above
Step 3: Configure Your Spark Application
After adding the dependent libraries, you’ll need to configure your Spark application to use the HBase jars to connect Hbase. You can do this by setting the following properties in your SparkConf
object:
val conf = HBaseConfiguration.create() conf.addResource(new Path("/etc/hbase/conf/hbase-site.xml")) new HBaseContext(spark.sparkContext, conf)
Step 4: Write data to the HBase table
Once you have configured your spark application to use all the Hbase jars and configuration, Can we proceed with loading the data into the HBase table
import org.apache.hadoop.hbase.spark.HBaseContext import org.apache.hadoop.hbase.HBaseConfiguration import org.apache.hadoop.fs.Path val conf = HBaseConfiguration.create() conf.addResource(new Path("/etc/hbase/conf/hbase-site.xml")) new HBaseContext(spark.sparkContext, conf) case class Employee(id:Long, name: String, age: Integer, mark: Float) import spark.implicits._ var studentDS = Seq( Student(1, "student1", 13, 400.5f), Student(2, "student2", 12, 450.0f), Student(3, "student3", 14, 498.5f) ).toDS() val columnMapping = "id Long :key, name STRING c:name, age Integer c:age, mark FLOAT c:mark" val format = "org.apache.hadoop.hbase.spark" val tableName = "students" // write the data to hbase table studentDS.write.format(format).option("hbase.columns.mapping",columnMapping).option("hbase.table", tableName).save()
In the above example, We have created a sample dataset that consists of student information like Id, Name, age, and mark. Once it is created, we can load the dataset to the HBase table using “studentDS.write.format” option
Step 5: Read data from the HBase table
Once we have data written on the Hbase table, We can read the table data which can be loaded into a Spark DataFrame.
// read the data from hbase table val df = spark.read.format(format).option("hbase.columns.mapping",columnMapping).option("hbase.table", tableName).load() df.show(truncate=false)
Step 6: Perform Spark Operations on the Data
Once you have loaded the data into a DataFrame, you can perform any Spark operations on it, and it’s just like any other regular DataFrame.
For example, you can filter the data, aggregate it, join it with other DataFrames, or write it to another data store.
Final code (which covers all the above steps)
$ spark-shell --master yarn --executor-cores 5 --deploy-mode client --jars /opt/cloudera/parcels/CDH/lib/hbase/*.jar --files /etc/hbase/conf/hbase-site.xml Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 23/03/30 17:04:16 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors before the AM has registered! Spark context Web UI available at http://localhost:4040 Spark context available as 'sc' (master = yarn, app id = application_1680147871532_0001). Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.4.0 /_/ Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_141) Type in expressions to have them evaluated. Type :help for more information. scala> import org.apache.hadoop.hbase.spark.HBaseContext import org.apache.hadoop.hbase.spark.HBaseContext scala> import org.apache.hadoop.hbase.HBaseConfiguration import org.apache.hadoop.hbase.HBaseConfiguration scala> import org.apache.hadoop.fs.Path import org.apache.hadoop.fs.Path scala> val conf = HBaseConfiguration.create() conf: org.apache.hadoop.conf.Configuration = Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml, hbase-default.xml, hbase-site.xml scala> conf.addResource(new Path("/etc/hbase/conf/hbase-site.xml")) scala> new HBaseContext(spark.sparkContext, conf) res1: org.apache.hadoop.hbase.spark.HBaseContext = org.apache.hadoop.hbase.spark.HBaseContext@2ba0b7cf scala> case class Student(id:Long, name: String, age: Integer, mark: Float) defined class Student scala> import spark.implicits._ import spark.implicits._ scala> var studentDS = Seq( | Student(1, "student1", 13, 400.5f), | Student(2, "student2", 12, 450.0f), | Student(3, "student3", 14, 498.5f) | ).toDS() studentDS: org.apache.spark.sql.Dataset[Student] = [id: bigint, name: string ... 2 more fields] scala> val columnMapping = "id Long :key, name STRING c:name, age Integer e:age, mark FLOAT c:mark" columnMapping: String = id Long :key, name STRING c:name, age Integer c:age, mark FLOAT c:mark scala> val format = "org.apache.hadoop.hbase.spark" format: String = org.apache.hadoop.hbase.spark scala> val tableName = "students" tableName: String = students scala> // write the data to hbase table studentDS.write.format(format).option("hbase.columns.mapping",columnMapping).option("hbase.table", tableName).save() id Long :key, name STRING c:name, age Integer c:age, mark FLOAT c:mark 23/03/30 17:05:36 WARN output.FileOutputCommitter: Output Path is null in setupJob() 23/03/30 17:05:46 WARN output.FileOutputCommitter: Output Path is null in commitJob() scala> // read the data from hbase table scala> val df = spark.read.format(format).option("hbase.columns.mapping",columnMapping).option("hbase.table", tableName).load() id Long :key, name STRING c:name, age Integer c:age, mark FLOAT c:mark df: org.apache.spark.sql.DataFrame = [mark: float, age: int ... 2 more fields] scala> df.show(truncate=false) +-----+---+--------+---+ |mark |age|name |id | +-----+---+--------+---+ |400.5|13 |student1|1 | |450.0|12 |student2|2 | |498.5|14 |student3|3 | +-----+---+--------+---+
Conclusion
Accessing HBase from Spark is a powerful combination that helps us to process and analyze Big data with ease. By following the steps outlined in this article, you can quickly and easily load data from/to HBase using Spark, and then perform Spark operations on it.
Good Luck with your Learning !!