How to Create a Hive table with Ozone Filesystem

Create a Hive table with Ozone Filesystem

Apache Hadoop Ozone is a distributed object store for Hadoop designed for scaling to trillions of objects and providing a unified namespace and storage for big data applications. In this article, we will walk through how to create a Hive table with Ozone step by step.

Prerequisites

Before we begin, ensure you have the following:

  • Apache Hadoop Ozone installed and configured.
  • Apache Hive installed and configured.
  • A basic understanding of Hive tables.

Steps to create a Hive table with Ozone

  1. First, create a new Ozone bucket to store the Hive table data. You can use the following command to create the bucket:
# ozone sh volume create /volume

21/08/25 06:23:27 INFO rpc.RpcClient: Creating Volume: vol1, with test_user as owner.

#  ozone sh bucket create /volume/hive-bucket

23/03/02 17:24:53 INFO rpc.RpcClient: Creating Bucket: volume/hive-bucket, with the Bucket Layout null, test_user as owner, Versioning false, Storage Type set to DISK and Encryption set to false 

  1. Next, create a new Hive database. You can use the following command to create the database:
CREATE DATABASE IF NOT EXISTS ozone_db;
  1. Once the database is created, switch to it using the following command:
USE  ozone_db;

  1. Now, create a new Hive table by specifying the storage handler as Ozone. You can use the following command to create the table:
CREATE EXTERNAL TABLE IF NOT EXISTS `learn-share`(  
   `id` int,                                     
   `name` string,                                   
   `age` int)
STORED AS parquet 
LOCATION 'o3fs://hive-bucket.volume.ozone/;                 

INFO  : Completed executing command(queryId=hive_20230302174616_7e82510e-8ff1-4388-a443-e3ec6f453b9d); Time taken: 1.469 seconds
INFO  : OK
No rows affected (2.192 seconds)

  1. Finally, load data into the table using the INSERT INTO statement. You can use the following command to insert data:
INSERT INTO TABLE ozone_db.learn-share VALUES (1, "learn", 10)
INSERT INTO TABLE ozone_db.learn-share VALUES (2, "share", 20)

----------------------------------------------------------------------------------------------
VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED  
-----------------------------------------------------------------------------
Map 1 .......... container     SUCCEEDED      1          1        0        0       0       0  
Reducer 2 ...... container     SUCCEEDED      1          1        0        0       0       0  
-----------------------------------------------------------------------------
VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 11.99 s    
----------------------------------------------------------------------------------------------
1 row affected (20.085 seconds)

select * from `learn-share`;
INFO  : Compiling command(queryId=hive_20230302175238_285a90c7-e702-4af1-a482-791ea5d1a214): select * from `learn-share`
INFO  : No Stats for default@learn-share, Columns: name, id, age
INFO  : Semantic Analysis Completed (retrial = false)
INFO  : Created Hive schema: Schema(fieldSchemas:[FieldSchema(name:learn-share.id, type:bigint, comment:null), FieldSchema(name:learn-share.name, type:string, comment:null), FieldSchema(name:learn-share.age, type:smallint, comment:null)], properties:null)
INFO  : Completed compiling command(queryId=hive_20230302175238_285a90c7-e702-4af1-a482-791ea5d1a214); Time taken: 7.314 seconds
INFO  : Executing command(queryId=hive_20230302175238_285a90c7-e702-4af1-a482-791ea5d1a214): select * from `learn-share`
INFO  : Completed executing command(queryId=hive_20230302175238_285a90c7-e702-4af1-a482-791ea5d1a214); Time taken: 0.088 seconds
INFO  : OK

+-----------------+-------------------+------------------+
| learn-share.id  | learn-share.name  | learn-share.age  |
+-----------------+-------------------+------------------+
| 1               | learn             | 10               |
+-----------------+-------------------+------------------+
1 row selected (11.231 seconds)

Access Hive table(with Ozone filesystem) from Spark 

Once you have updated the Table location to the Ozone filesystem, then spark and the hive will forward the request to the Ozone FS by default and we didn’t need to do any other configuration

Welcome to
____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.4.8
      /_/
Using Python version 2.7.5 (default, Oct 14 2020 14:45:30)
SparkSession available as 'spark'.
>>> spark.sql("select * from `learn-share`").show()
Hive Session ID = 6d94b2ca-5882-4f44-95fd-eddfa550c69a
23/03/02 17:58:00 WARN metadata.Hive: Failed to register udf functions in external driver.
+---+-----+---+                                                                 
| id| name|age|
+---+-----+---+
|  1|learn| 10|
+---+-----+---+
>>>

    

Insert into Hive table using spark

>>> spark.sql("""INSERT INTO TABLE `learn-share` VALUES (2, "share", 20)""")

23/03/02 18:00:54 WARN ozone.BasicOzoneFileSystem: delete: Path does not exist: o3fs://hive-bucket.volume.ozone1/.spark-staging-ec0a8462-d4d4-4598-b018-d675fdfc24d8

DataFrame[]

>>> spark.sql("select * from `learn-share`").show()
+---+-----+---+                                                                 
| id| name|age|
+---+-----+---+
|  2|share| 20|
|  1|learn| 10|
+---+-----+---+

Note: If you are facing the below issue while connecting the Ozone filesystem

java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.ozone.OzoneFileSystem not found

We can explicitly add the jar file While triggering the pyspark or spark command as below

pyspark --jars /opt/cloudera/parcels/CDH/lib/hadoop-ozone/hadoop-ozone-filesystem-hadoop3-*.jar

Check here to learn, How to run spark job with Ozone Filesystem

Conclusion

Creating a Hive table with Ozone is a straightforward process. First, create an Ozone bucket to store the table data, then create a Hive database and switch to it. Finally, create a new Hive table with the Ozone storage handler and load data into it. By following these steps, you can easily leverage the scalability and unified namespace provided by Hadoop Ozone in your Hive tables.

For more information about the various Ozone command-line tools and the Ozone shell, check here

Good Luck with your Learning !!

Similar Posts