First of all, let’s start with some background information. Apache Hive is a data warehousing tool that allows you to query and analyze large datasets stored in Hadoop Distributed File System (HDFS) using SQL-like syntax. On the other hand, Python is a popular programming language that is widely used in the data science community for data analysis, machine learning, and other data-related tasks.
Connecting to Hive from Python is a straightforward process that involves the use of a Python package called PyHive. PyHive is a Python client for HiveServer2, which provides a way to access Hive databases and execute queries.
Apache Hive is built on top of Hadoop and provides a SQL-like interface to query and analyze data. Python is a popular programming language for data analysis and machine learning. In this article, we will discuss how to connect to Hive from Python using the PyHive library.
Before proceeding with this article, make sure you have the following installed on your system:
- Python (version 3 or higher)
- PyHive library (version 0.6.3 or higher) + dependent modules
- Hive – Hive Metastore and Hiveserver2 services are running in your Cluster
Install Python 3
I am installing python version 3 on the Redhat server below is the syntax it will be changing based on the OS flavor.
sudo yum install python3
You can also download the latest python version from Python.org
Install PyHive Library
We are going to install pyHive and its dependent module
pip install sasl pip install thrift-sasl pip install thrift pip install PyHive
NOTE: You might be facing an issue while installing “sasl” module as below
$ pip install sasl Collecting sasl Downloading https://files.pythonhosted.org/packages/df/ae/d8dda9ef1636f548935c271910d3b35afbf1782df582fda88a13ea48de53/sasl-0.3.1.tar.gz (44kB) 100% |████████████████████████████████| 51kB 5.4MB/s Requirement already satisfied (use --upgrade to upgrade): six in /usr/lib/python2.7/site-packages (from sasl) Installing collected packages: sasl Running setup.py install for sasl ... error D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv -DNDEBUG -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv -fPIC -Isasl -I/usr/include/python2.7 -c sasl/saslwrapper.cpp -o build/temp.linux-x86_64-2.7/sasl/saslwrapper.o gcc: error trying to exec 'cc1plus': execvp: No such file or directory error: command 'gcc' failed with exit status 1
To get rid of this error, We need to install the dependent package in the Linux box
yum install libsasl2-dev
yum install python-devel openldap-devel
For me, It worked by installing python-devel package
Connecting to Hive from Python
To connect to Hive from Python, we need to use the PyHive library. PyHive provides Python bindings for the Apache Hive database. It allows us to interact with Hive using Python’s DB-API interface. PyHive supports both Python 2 and Python 3 versions.
Step 1: Import required libraries
Before we start connecting to Hive, we need to import the required libraries in our Python script. We will use the following libraries:
from pyhive import hive
Step 2: Establish a connection to Hive
To establish a connection to Hive, we need to create a connection object using the connection parameters such as hostname, port number, database name, and user credentials. The following code snippet demonstrates how to create a connection object:
conn = hive.Connection(host="hostname", port=10000, database="mydb", username="myuser")
In the above code, we are creating a connection object to Hiveserver2 service on <hostname> with port number 10000 (Default Hiveserver2 port). We are connecting to the database named mydb with the username myuser. You can replace these parameters with your own connection parameters.
Step 3: Execute HiveQL queries
Once we have established a connection to Hive, we can execute HiveQL queries using the connection object. We will use the cursor object to execute queries. The cursor object is created by calling the cursor() method on the connection object. The following code snippet demonstrates how to execute a HiveQL query:
cursor = conn.cursor() cursor.execute("SELECT * FROM mytable")
In the above code, we are creating a cursor object and executing a SELECT query on a table named mytable.
Step 4: Fetch query results
After executing the query, we need to fetch the results using the fetchall() or fetchone() method of the cursor object. The fetchall() method returns all the rows of the result set as a list of tuples, whereas the fetchone() method returns only one row of the result set as a tuple. The following code snippet demonstrates how to fetch the results of a query:
Click here to know more about fetchall() and fetchone() and their differences
rows = cursor.fetchall()
Step 5: Display the results
Once we have fetched the results, We can display the results using the print statement below
Final Working Code
# cat python_fetchall.py from pyhive import hive # establish connection to Hive server connection = hive.connect(host='hostname', port=10000, username='systest') # create cursor object cursor = connection.cursor() # execute SELECT statement to retrieve data from a table cursor.execute("SELECT * FROM sample_08") # fetch all rows at once using fetchall() rows = cursor.fetchall() for row in rows: print(row) # close cursor and connection cursor.close() connection.close()
]# python python_fetchall.py (u'00-0000', u'All Occupations', 135185230, 42270) (u'11-0000', u'Management occupations', 6152650, 100310) (u'11-1011', u'Chief executives', 301930, 160440) (u'11-1021', u'General and operations managers', 1697690, 107970)
In conclusion, connecting Apache Hive with Python using PyHive is a great way to leverage the strengths of both tools for data analysis tasks. PyHive provides a Python DB-API interface to Hive, allowing you to execute SQL-like queries in Python and get the results as Python objects, making it easier to perform data analysis tasks using Python.
Overall, this can be a powerful combination for any data scientist or analyst who works with large datasets stored in Hadoop.
Good Luck with your Learning !!