How to Create a Hive table with Ozone Filesystem

Apache Hadoop Ozone is a distributed object store for Hadoop designed for scaling to trillions of objects and providing a unified namespace and storage for big data applications. In this article, we will walk through how to create a Hive table with Ozone step by step.
Prerequisites
Before we begin, ensure you have the following:
- Apache Hadoop Ozone installed and configured.
- Apache Hive installed and configured.
- A basic understanding of Hive tables.
Steps to create a Hive table with Ozone
- First, create a new Ozone bucket to store the Hive table data. You can use the following command to create the bucket:
# ozone sh volume create /volume 21/08/25 06:23:27 INFO rpc.RpcClient: Creating Volume: vol1, with test_user as owner.
# ozone sh bucket create /volume/hive-bucket 23/03/02 17:24:53 INFO rpc.RpcClient: Creating Bucket: volume/hive-bucket, with the Bucket Layout null, test_user as owner, Versioning false, Storage Type set to DISK and Encryption set to false
- Next, create a new Hive database. You can use the following command to create the database:
CREATE DATABASE IF NOT EXISTS ozone_db;
- Once the database is created, switch to it using the following command:
USE ozone_db;
- Now, create a new Hive table by specifying the storage handler as Ozone. You can use the following command to create the table:
CREATE EXTERNAL TABLE IF NOT EXISTS `learn-share`( `id` int, `name` string, `age` int) STORED AS parquet LOCATION 'o3fs://hive-bucket.volume.ozone/;
INFO : Completed executing command(queryId=hive_20230302174616_7e82510e-8ff1-4388-a443-e3ec6f453b9d); Time taken: 1.469 seconds INFO : OK No rows affected (2.192 seconds)
- Finally, load data into the table using the INSERT INTO statement. You can use the following command to insert data:
INSERT INTO TABLE ozone_db.learn-share VALUES (1, "learn", 10) INSERT INTO TABLE ozone_db.learn-share VALUES (2, "share", 20)
---------------------------------------------------------------------------------------------- VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED ----------------------------------------------------------------------------- Map 1 .......... container SUCCEEDED 1 1 0 0 0 0 Reducer 2 ...... container SUCCEEDED 1 1 0 0 0 0 ----------------------------------------------------------------------------- VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 11.99 s ---------------------------------------------------------------------------------------------- 1 row affected (20.085 seconds)
select * from `learn-share`;
INFO : Compiling command(queryId=hive_20230302175238_285a90c7-e702-4af1-a482-791ea5d1a214): select * from `learn-share` INFO : No Stats for default@learn-share, Columns: name, id, age INFO : Semantic Analysis Completed (retrial = false) INFO : Created Hive schema: Schema(fieldSchemas:[FieldSchema(name:learn-share.id, type:bigint, comment:null), FieldSchema(name:learn-share.name, type:string, comment:null), FieldSchema(name:learn-share.age, type:smallint, comment:null)], properties:null) INFO : Completed compiling command(queryId=hive_20230302175238_285a90c7-e702-4af1-a482-791ea5d1a214); Time taken: 7.314 seconds INFO : Executing command(queryId=hive_20230302175238_285a90c7-e702-4af1-a482-791ea5d1a214): select * from `learn-share` INFO : Completed executing command(queryId=hive_20230302175238_285a90c7-e702-4af1-a482-791ea5d1a214); Time taken: 0.088 seconds INFO : OK
+-----------------+-------------------+------------------+ | learn-share.id | learn-share.name | learn-share.age | +-----------------+-------------------+------------------+ | 1 | learn | 10 | +-----------------+-------------------+------------------+ 1 row selected (11.231 seconds)
Access Hive table(with Ozone filesystem) from Spark
Once you have updated the Table location to the Ozone filesystem, then spark and the hive will forward the request to the Ozone FS by default and we didn’t need to do any other configuration
Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 2.4.8 /_/ Using Python version 2.7.5 (default, Oct 14 2020 14:45:30) SparkSession available as 'spark'. >>> spark.sql("select * from `learn-share`").show() Hive Session ID = 6d94b2ca-5882-4f44-95fd-eddfa550c69a 23/03/02 17:58:00 WARN metadata.Hive: Failed to register udf functions in external driver. +---+-----+---+ | id| name|age| +---+-----+---+ | 1|learn| 10| +---+-----+---+ >>>
Insert into Hive table using spark
>>> spark.sql("""INSERT INTO TABLE `learn-share` VALUES (2, "share", 20)""") 23/03/02 18:00:54 WARN ozone.BasicOzoneFileSystem: delete: Path does not exist: o3fs://hive-bucket.volume.ozone1/.spark-staging-ec0a8462-d4d4-4598-b018-d675fdfc24d8 DataFrame[] >>> spark.sql("select * from `learn-share`").show() +---+-----+---+ | id| name|age| +---+-----+---+ | 2|share| 20| | 1|learn| 10| +---+-----+---+
Note: If you are facing the below issue while connecting the Ozone filesystem
java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.ozone.OzoneFileSystem not found
We can explicitly add the jar file While triggering the pyspark or spark command as below
pyspark --jars /opt/cloudera/parcels/CDH/lib/hadoop-ozone/hadoop-ozone-filesystem-hadoop3-*.jar
Check here to learn, How to run spark job with Ozone Filesystem
Conclusion
Creating a Hive table with Ozone is a straightforward process. First, create an Ozone bucket to store the table data, then create a Hive database and switch to it. Finally, create a new Hive table with the Ozone storage handler and load data into it. By following these steps, you can easily leverage the scalability and unified namespace provided by Hadoop Ozone in your Hive tables.
For more information about the various Ozone command-line tools and the Ozone shell, check here
Good Luck with your Learning !!