How to Save DataFrame as a CSV File in Spark
Spark provides a lot of APIs to save DataFrame to multiple formats like CSV, Parquet, Hive tables, etc. In this article, We will learn to save DataFrame as a CSV file. Let’s dive in
To save a DataFrame as a CSV file in Spark involves using the df.write.csv(“file.csv”) method, with options like delimiter and encoding customizable through the option() method. Based on partition and data size, Spark automatically generates multiple part files.

Prerequisites
It is mandatory to have the pre-requisites checked for completion
- Spark installed and set up on your system
- A basic understanding of Spark DataFrames
Save DataFrame as a CSV File
In Spark, We can use the Write
method to save a DataFrame to our desired CSV format. write
the method provided by the DataFrameWriter class. I have created a step-by-step process to achieve it
I am using Pyspark (Python code) for this example. Hope it will be helpful
NOTE: We have also discussed how to Save Spark DataFrame directly to Hive (Check it out)
Step 1: Create a DataFrame
We are going to create a DataFrame in this step, This DataFrame would be used in the earlier example on saving DataFrame as CSV files. For that, I have created a sample dataset and converted it to DataFrame using createDataFrame
() method
Before that trigger a Pyspark session, Which by default will open a Spark session as below
#pyspark
Once the spark session is created, We can proceed with the DataFrame creation
data = [ {'Name': 'Iron Man', 'Age': 25, 'City': 'New York'}, {'Name': 'Captain America', 'Age': 30, 'City': 'Los Angeles'}, {'Name': 'Hulk', 'Age': 35, 'City': 'Chicago'} ] # Create a DataFrame from the sample data df = spark.createDataFrame(data) # Show the DataFrame df.show()
Output
>>> data = [ ... {'Name': 'Iron Man', 'Age': 25, 'City': 'New York'}, ... {'Name': 'Captain America', 'Age': 30, 'City': 'Los Angeles'}, ... {'Name': 'Hulk', 'Age': 35, 'City': 'Chicago'} ... ] >>> df = spark.createDataFrame(data) >>> df.show() +---+-----------+---------------+ |Age| City| Name| +---+-----------+---------------+ | 25| New York| Iron Man| | 30|Los Angeles|Captain America| | 35| Chicago| Hulk| +---+-----------+---------------+
Step 2: Save DataFrame as CSV
In this step, We will be saving the DataFrame to a CSV file using write.csv()
method. This method expects “path” as an input
df.write.csv("/tmp/file")
Replace "
with your actual path where you want to save the CSV files. By default, If you specify the path as /tmp/file it will save it in the HDFS location/tmp/file
"
Output:
# hadoop fs -ls /tmp/file.csv -rw-r--r-- 3 test_user supergroup 0 2023-06-14 15:05 /tmp/file/_SUCCESS -rw-r--r-- 3 test_user supergroup 21 2023-06-14 15:05 /tmp/file/part-00000.csv # Reading the file directly from HDFS path #hadoop fs -cat /tmp/file/part-00000.csv 25,New York,Iron Man 30,Los Angeles,Captain America 35,Chicago,Hulk
Also, By default, Spark includes a header row in the CSV file, which contains the column names. If you don’t want to include the header, you can disable it by setting the header
option to False
as below
df.write.option("header", "false").csv("/tmp/file
")
Customizing CSV Writing Options
Spark provides various options to customize the CSV writing behavior. Below are some of the options
Delimiter
Using this delimiter option, We can choose the delimiter while saving it as a CSV file. But default Spark uses comma (,) as the delimiter, We can change this based on our need, Below example is to store the file with delimiter as pipe symbol (|)
df.write.option("delimiter", "|").csv("/tmp/file")
#Example
# hadoop fs -cat /tmp/file/part-00000.csv
25|New York|Iron Man
30|Los Angeles|Captain America
35|Chicago|Hulk
Encoding
You can specify the encoding of the output CSV file using the encoding
option. By default, Spark uses the UTF-8 encoding. Below example it to specify the encoding as UTF-16:
df.write.option("encoding", "UTF-16").csv("/tmp/file")
Quote and Escape Characters
You can specify custom characters for quoting and escaping in the CSV file using the quote
and escape
options, respectively. below is the example
df.write.format("csv").option("quote", "'").option("escape", "\\").save("path/to/save")
Null Value
You can specify a custom string to represent null values in the CSV file using the nullValue
option. By default, Spark uses an empty string. The below example shows to set “NA” for null values
df.write.format("csv").option("nullValue", "NA").save("path/to/save")
Other options
Spark provides many other options, such as
, dateFormat
, timestampFormat
, and more. You can refer to the Spark documentation for a comprehensive list of available options and their usage.encoding
Modes for Handling Existing Output Paths
DataFrameWriter
in Spark also provides a mode()
method, which allows you to specify the behavior in case the output path already exists.
Append Mode
df.write.format("csv").mode("append").save("path/to/save")
In this Mode, if the provided output path already exists, the DataFrame will be appended to the existing data/file.
Overwrite Mode
df.write.format("csv").mode("overwrite").save("path/to/save")
In this Mode, if the provided output path already exists, it will overwrite the already available data with the DataFrame.
Ignore Mode
df.write.format("csv").mode("ignore").save("path/to/save")
In this Mode, if the provided output path already exists, the DataFrame will not be written, and no error will be thrown.
Error Mode (by Default)
df.write.format("csv").mode("error").save("path/to/save")
In this Mode, if the provided output path already exists, an error will be thrown, indicating that the write operation failed.
Handling Large DataFrames
Let’s say you are trying to convert a DataFrame with multiple partitions and a large dataset, Spark by default will write the CSV file as multiple part files in the specified directory based on the number of partitions and the size of the dataset. Each part file will have a part number appended to its name, such as part-00000
, part-00001
, and so on.
Example:
# hadoop fs -ls /tmp/file -rw-r--r-- 3 test_user supergroup 0 2023-06-14 15:05 /tmp/file.csv/_SUCCESS -rw-r--r-- 3 test_user supergroup 21 2023-06-14 15:05 /tmp/file.csv/part-00000-c000.csv -rw-r--r-- 3 test_user supergroup 47 2023-06-14 15:05 /tmp/file.csv/part-00001-c000.csv
Conclusion
In Summary, To Save a DataFrame as a CSV file in Spark can be done using the write.csv()
method provided by the DataFrameWriter class. Along with we have to specify the path and various other options like delimiter, and saving mode (either to append to an existing file or create a new file ) which have been discussed in depth in this article.
Based on the pre-requisite, We need to have a Spark setup with Python installed, As I have tested the above examples with Python and in Cloudera Distribution Hadoop Cluster.
Good Luck with your Learning !!
Related Topics:
What is the difference between apache spark and pyspark
Difference between map and mapValues functions in Spark
How to read and write XML files using Spark?