How to connect Apache Hive from Python

First of all, let’s start with some background information. Apache Hive is a data warehousing tool that allows you to query and analyze large datasets stored in Hadoop Distributed File System (HDFS) using SQL-like syntax. On the other hand, Python is a popular programming language that is widely used in the data science community for data analysis, machine learning, and other data-related tasks.

Connecting to Hive from Python is a straightforward process that involves the use of a Python package called PyHive. PyHive is a Python client for HiveServer2, which provides a way to access Hive databases and execute queries.

How to connect Apache Hive from Python
Connect

Introduction

Apache Hive is built on top of Hadoop and provides a SQL-like interface to query and analyze data. Python is a popular programming language for data analysis and machine learning. In this article, we will discuss how to connect to Hive from Python using the PyHive library.

Prerequisites

Before proceeding with this article, make sure you have the following installed on your system:

  1. Python (version 3 or higher)
  2. PyHive library (version 0.6.3 or higher) + dependent modules
  3. Hive – Hive Metastore and Hiveserver2 services are running in your Cluster

Install Python 3

I am installing python version 3 on the Redhat server below is the syntax it will be changing based on the OS flavor.

sudo yum install python3

You can also download the latest python version from Python.org

Install PyHive Library

We are going to install pyHive and its dependent module

pip install sasl
pip install thrift-sasl
pip install thrift
pip install PyHive

NOTE: You might be facing an issue while installing “sasl” module as below

$ pip install sasl
Collecting sasl
  Downloading https://files.pythonhosted.org/packages/df/ae/d8dda9ef1636f548935c271910d3b35afbf1782df582fda88a13ea48de53/sasl-0.3.1.tar.gz (44kB)
    100% |████████████████████████████████| 51kB 5.4MB/s 
Requirement already satisfied (use --upgrade to upgrade): six in /usr/lib/python2.7/site-packages (from sasl)
Installing collected packages: sasl
  Running setup.py install for sasl ... error
    D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv -DNDEBUG -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv -fPIC -Isasl -I/usr/include/python2.7 -c sasl/saslwrapper.cpp -o build/temp.linux-x86_64-2.7/sasl/saslwrapper.o
    gcc: error trying to exec 'cc1plus': execvp: No such file or directory
    error: command 'gcc' failed with exit status 1

To get rid of this error, We need to install the dependent package in the Linux box

yum install libsasl2-dev

or

 yum install python-devel openldap-devel

For me, It worked by installing python-devel package

Connecting to Hive from Python

To connect to Hive from Python, we need to use the PyHive library. PyHive provides Python bindings for the Apache Hive database. It allows us to interact with Hive using Python’s DB-API interface. PyHive supports both Python 2 and Python 3 versions.

Step 1: Import required libraries

Before we start connecting to Hive, we need to import the required libraries in our Python script. We will use the following libraries:

from pyhive import hive 

Step 2: Establish a connection to Hive

To establish a connection to Hive, we need to create a connection object using the connection parameters such as hostname, port number, database name, and user credentials. The following code snippet demonstrates how to create a connection object:

conn = hive.Connection(host="hostname", port=10000, database="mydb", username="myuser")

In the above code, we are creating a connection object to Hiveserver2 service on <hostname> with port number 10000 (Default Hiveserver2 port). We are connecting to the database named mydb with the username myuser. You can replace these parameters with your own connection parameters.

Step 3: Execute HiveQL queries

Once we have established a connection to Hive, we can execute HiveQL queries using the connection object. We will use the cursor object to execute queries. The cursor object is created by calling the cursor() method on the connection object. The following code snippet demonstrates how to execute a HiveQL query:

cursor = conn.cursor() cursor.execute("SELECT * FROM mytable")

In the above code, we are creating a cursor object and executing a SELECT query on a table named mytable.

Step 4: Fetch query results

After executing the query, we need to fetch the results using the fetchall() or fetchone() method of the cursor object. The fetchall() method returns all the rows of the result set as a list of tuples, whereas the fetchone() method returns only one row of the result set as a tuple. The following code snippet demonstrates how to fetch the results of a query:

Click here to know more about fetchall() and fetchone() and their differences

rows = cursor.fetchall()

Step 5: Display the results

Once we have fetched the results, We can display the results using the print statement below

print(row)

Final Working Code

# cat python_fetchall.py 

from pyhive import hive 

# establish connection to Hive server 

connection = hive.connect(host='hostname', port=10000, username='systest') 

# create cursor object 

cursor = connection.cursor() 

# execute SELECT statement to retrieve data from a table 

cursor.execute("SELECT * FROM sample_08") 

# fetch all rows at once using fetchall() 

rows = cursor.fetchall() 
for row in rows: 
    print(row) 

# close cursor and connection 

cursor.close() 
connection.close()

Output:

]# python python_fetchall.py

(u'00-0000', u'All Occupations', 135185230, 42270)

(u'11-0000', u'Management occupations', 6152650, 100310)

(u'11-1011', u'Chief executives', 301930, 160440)

(u'11-1021', u'General and operations managers', 1697690, 107970)

Conclusion

In conclusion, connecting Apache Hive with Python using PyHive is a great way to leverage the strengths of both tools for data analysis tasks. PyHive provides a Python DB-API interface to Hive, allowing you to execute SQL-like queries in Python and get the results as Python objects, making it easier to perform data analysis tasks using Python.

Overall, this can be a powerful combination for any data scientist or analyst who works with large datasets stored in Hadoop.

Good Luck with your Learning !!

Similar Posts