How to read and write Excel files with Spark

Apache Spark is a powerful data processing framework, Commonly, Spark is used to process data stored in various formats, including Excel files. In this article, We will learn How to read and write Excel files from dataframe.

To read and write Excel files in Spark with Pandas and spark_excel, install dependencies, and use read_excel() and to_excel() for Pandas DataFrames. Use write_to_excel() the method for Spark DataFrames from spark_excel the library. Options like sheet name and file location must be specified.

Prerequisites

Before we start, we hope that you have installed Spark and have a basic understanding of Spark programming in Python. In addition, we will be using the “spark-excel”, Pandas library to read and write data into Excel.

Create an Excel sheet

I have created the below Excel and uploaded it to the HDFS location “/tmp/Spark_excel.xlsx”

Read Excel files with Spark

Reading Excel Files

For Reading excel files in Python, There are 2 easy ways

  1. Using spark-excel library
  2. Using Pandas

Using spark-excel library

In this example, We are using “spark-excel” library to read data from an Excel file, To use the “spark-excel” library, We need to make sure to add the package while triggering the spark command. This helps to add “spark-excel” and all its dependent libraries

1. Start a Spark session and import the necessary libraries:

pyspark --packages com.crealytics:spark-excel_2.11:0.13.1
>>> from pyspark.sql import SparkSession

2. Read the Excel file into a Spark DataFrame:

>>> spark = SparkSession.builder.appName("Read Excel file with PySpark").getOrCreate()

>>> df = spark.read.format("com.crealytics.spark.excel").option("sheetName", "Sheet1").option("header", "true").option("treatEmptyValuesAsNulls", "false").option("inferSchema", "true").load("/tmp/Spark_excel.xlsx")

3. Show the DataFrame to verify the data was read correctly:

>>> df.show() 
+---+----+----+                                                              
| id|name| age|
+---+----+----+
|1.0| Tom|23.0|
|2.0|Rich|24.0|
|3.0|Bean|32.0|
+---+----+----+                                                                  

In the above example, we created a SparkSession and used the spark.read.format() method to specify the format of the input file. We also specify the sheet name, whether or not to use the first row as headers, how to handle empty cells, and whether to infer the schema from the data.

At last, we use the load() method to read the Excel file into a Spark DataFrame and then use show() the method to display the contents of the DataFrame.

Using Pandas

To use the Pandas library, We need to install the Pandas and all its dependent modules

pip2 install pandas==0.23.0 numpy==1.16.6
pip2 install xlrd==0.9.0
pip2 install openpyxl

Once, Installed the above modules, We can go ahead and trigger pyspark session

pyspark

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.4.8.7.2.17.0-230
      /_/

Using Python version 2.7.5 (default, Jun 28 2022 15:30:04)
SparkSession available as 'spark'.

1. import the necessary libraries:

>>> import pandas as pd

2. Read the Excel file into a Spark DataFrame:

>>> df = pd.read_excel("/tmp/Spark_excel.xlsx", sheet_name="Sheet1", header=0, dtype={"id": int, "name": str, "age": int})

3. Show the DataFrame to verify the data was read correctly:

>>> print(df.head())
   id  name  age
0   1   Tom   23
1   2  Rich   24
2   3  Bean   32
>>> 

In the above example, we use the pd.read_excel() function to read the Excel file into a Pandas DataFrame. We specify the file path, the sheet name, the header row index, and the data types of the columns.

At Last, we use the head() method to display the first few rows of the DataFrame.

Writing Excel Files

Again, For Writing into Excel files, There are 2 easy ways

  1. Using spark-excel library
  2. Using Pandas

Using spark-excel library

The initialization part will be similar to reading data from Excel, We need to make sure to add the “spark-excel” package while triggering the spark command. This helps to add “spark-excel” and all its dependent libraries

1. Start a Spark session and import the necessary libraries:

pyspark --packages com.crealytics:spark-excel_2.11:0.13.1
>>> from pyspark.sql import SparkSession

>>> from pyspark.sql.types import StructType, StructField, StringType, IntegerType

2. Create a DataFrame:

>>> spark = SparkSession.builder.appName("Write Excel file with PySpark").getOrCreate()

>>> data = [(1, "John", 30), (2, "Jane", 25), (3, "Bob", 40)]

>>> schema = StructType([StructField("id", IntegerType(), True), StructField("name", StringType(), True), StructField("age", IntegerType(), True)])

>>> df = spark.createDataFrame(data, schema)

3. Write the Spark DataFrame to an Excel file:

>>> df.write.format("com.crealytics.spark.excel").option("sheetName", "Sheet1").option("header", "true").option("dateFormat", "yyyy-MM-dd HH:mm:ss").option("timestampFormat", "yyyy-MM-dd HH:mm:ss").mode("overwrite").save("/tmp/spark_excel.xlsx")

Write Excel files with Spark

In the above example, we specify the sheet name, whether or not to use the column names as headers, and the date and timestamp formats. We also specify the mode as “overwrite” so that any existing file with the same name will be replaced.

Using Pandas

Similarly, We can use the Pandas library to write into Excel files. Make sure to install the dependent module before processing further

pip2 install pandas==0.23.0 numpy==1.16.6 
pip2 install xlrd==0.9.0 
pip2 install openpyxl

Write the Pandas DataFrame to an Excel file:

import pandas as pd 

data = {"id": [1, 2, 3], "name": ["John", "Jane", "Bob"], "age": [30, 25, 40]} df = pd.DataFrame(data) df.to_excel("/tmp/new_excel.xlsx", sheet_name="Sheet1", index=False)

In the above example, we created a Pandas DataFrame from a dictionary and then use the df.to_excel() method to write the DataFrame to an Excel file. We specify the output file path, the sheet name, and whether or not to include the row index.

Conclusion

In this example, we demonstrated how to read and write Excel files using PySpark. By using the “spark-excel”, Pandas library and leveraging Spark’s distributed and scalable processing capabilities, we can easily handle large Excel files in a Python-based Spark application.

We have created all the examples in a plug-and-play model, You can run the above example by copying and pasting it into the Pyspark session :-). Initiate a discussion if you have any questions

Good Luck with your Learning !!

Similar Posts