How to read and write Excel files with Spark
Apache Spark is a powerful data processing framework, Commonly, Spark is used to process data stored in various formats, including Excel files. In this article, We will learn How to read and write Excel files from dataframe.
To read and write Excel files in Spark with Pandas and spark_excel, install dependencies, and use read_excel() and to_excel() for Pandas DataFrames. Use write_to_excel() the method for Spark DataFrames from spark_excel the library. Options like sheet name and file location must be specified.
Prerequisites
Before we start, we hope that you have installed Spark and have a basic understanding of Spark programming in Python. In addition, we will be using the “spark-excel”, Pandas library to read and write data into Excel.
Create an Excel sheet
I have created the below Excel and uploaded it to the HDFS location “/tmp/Spark_excel.xlsx”

Reading Excel Files
For Reading excel files in Python, There are 2 easy ways
- Using spark-excel library
- Using Pandas
Using spark-excel library
In this example, We are using “spark-excel” library to read data from an Excel file, To use the “spark-excel” library, We need to make sure to add the package while triggering the spark command. This helps to add “spark-excel” and all its dependent libraries
1. Start a Spark session and import the necessary libraries:
pyspark --packages com.crealytics:spark-excel_2.11:0.13.1
>>> from pyspark.sql import SparkSession
2. Read the Excel file into a Spark DataFrame:
>>> spark = SparkSession.builder.appName("Read Excel file with PySpark").getOrCreate() >>> df = spark.read.format("com.crealytics.spark.excel").option("sheetName", "Sheet1").option("header", "true").option("treatEmptyValuesAsNulls", "false").option("inferSchema", "true").load("/tmp/Spark_excel.xlsx")
3. Show the DataFrame to verify the data was read correctly:
>>> df.show() +---+----+----+ | id|name| age| +---+----+----+ |1.0| Tom|23.0| |2.0|Rich|24.0| |3.0|Bean|32.0| +---+----+----+
In the above example, we created a SparkSession and used the spark.read.format()
method to specify the format of the input file. We also specify the sheet name, whether or not to use the first row as headers, how to handle empty cells, and whether to infer the schema from the data.
At last, we use the load()
method to read the Excel file into a Spark DataFrame and then use show()
the method to display the contents of the DataFrame.
Using Pandas
To use the Pandas library, We need to install the Pandas and all its dependent modules
pip2 install pandas==0.23.0 numpy==1.16.6 pip2 install xlrd==0.9.0 pip2 install openpyxl
Once, Installed the above modules, We can go ahead and trigger pyspark session
pyspark Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 2.4.8.7.2.17.0-230 /_/ Using Python version 2.7.5 (default, Jun 28 2022 15:30:04) SparkSession available as 'spark'.
1. import the necessary libraries:
>>> import pandas as pd
2. Read the Excel file into a Spark DataFrame:
>>> df = pd.read_excel("/tmp/Spark_excel.xlsx", sheet_name="Sheet1", header=0, dtype={"id": int, "name": str, "age": int})
3. Show the DataFrame to verify the data was read correctly:
>>> print(df.head()) id name age 0 1 Tom 23 1 2 Rich 24 2 3 Bean 32 >>>
In the above example, we use the pd.read_excel()
function to read the Excel file into a Pandas DataFrame. We specify the file path, the sheet name, the header row index, and the data types of the columns.
At Last, we use the head()
method to display the first few rows of the DataFrame.
Writing Excel Files
Again, For Writing into Excel files, There are 2 easy ways
- Using spark-excel library
- Using Pandas
Using spark-excel library
The initialization part will be similar to reading data from Excel, We need to make sure to add the “spark-excel” package while triggering the spark command. This helps to add “spark-excel” and all its dependent libraries
1. Start a Spark session and import the necessary libraries:
pyspark --packages com.crealytics:spark-excel_2.11:0.13.1
>>> from pyspark.sql import SparkSession >>> from pyspark.sql.types import StructType, StructField, StringType, IntegerType
2. Create a DataFrame:
>>> spark = SparkSession.builder.appName("Write Excel file with PySpark").getOrCreate() >>> data = [(1, "John", 30), (2, "Jane", 25), (3, "Bob", 40)] >>> schema = StructType([StructField("id", IntegerType(), True), StructField("name", StringType(), True), StructField("age", IntegerType(), True)]) >>> df = spark.createDataFrame(data, schema)
3. Write the Spark DataFrame to an Excel file:
>>> df.write.format("com.crealytics.spark.excel").option("sheetName", "Sheet1").option("header", "true").option("dateFormat", "yyyy-MM-dd HH:mm:ss").option("timestampFormat", "yyyy-MM-dd HH:mm:ss").mode("overwrite").save("/tmp/spark_excel.xlsx")

In the above example, we specify the sheet name, whether or not to use the column names as headers, and the date and timestamp formats. We also specify the mode as “overwrite” so that any existing file with the same name will be replaced.
Using Pandas
Similarly, We can use the Pandas library to write into Excel files. Make sure to install the dependent module before processing further
pip2 install pandas==0.23.0 numpy==1.16.6 pip2 install xlrd==0.9.0 pip2 install openpyxl
Write the Pandas DataFrame to an Excel file:
import pandas as pd data = {"id": [1, 2, 3], "name": ["John", "Jane", "Bob"], "age": [30, 25, 40]} df = pd.DataFrame(data) df.to_excel("/tmp/new_excel.xlsx", sheet_name="Sheet1", index=False)
In the above example, we created a Pandas DataFrame from a dictionary and then use the df.to_excel()
method to write the DataFrame to an Excel file. We specify the output file path, the sheet name, and whether or not to include the row index.
Conclusion
In this example, we demonstrated how to read and write Excel files using PySpark. By using the “spark-excel”, Pandas library and leveraging Spark’s distributed and scalable processing capabilities, we can easily handle large Excel files in a Python-based Spark application.
We have created all the examples in a plug-and-play model, You can run the above example by copying and pasting it into the Pyspark session :-). Initiate a discussion if you have any questions