How to Save DataFrame as a CSV File in Spark

Spark provides a lot of APIs to save DataFrame to multiple formats like CSV, Parquet, Hive tables, etc. In this article, We will learn to save DataFrame as a CSV file. Let’s dive in

To save a DataFrame as a CSV file in Spark involves using the df.write.csv(“file.csv”) method, with options like delimiter and encoding customizable through the option() method. Based on partition and data size, Spark automatically generates multiple part files.

Save DataFrame as a CSV File in Spark

Prerequisites

It is mandatory to have the pre-requisites checked for completion

Save DataFrame as a CSV File

In Spark, We can use the Write method to save a DataFrame to our desired CSV format. write the method provided by the DataFrameWriter class. I have created a step-by-step process to achieve it

I am using Pyspark (Python code) for this example. Hope it will be helpful

NOTE: We have also discussed how to Save Spark DataFrame directly to Hive (Check it out)

Step 1: Create a DataFrame

We are going to create a DataFrame in this step, This DataFrame would be used in the earlier example on saving DataFrame as CSV files. For that, I have created a sample dataset and converted it to DataFrame using createDataFrame() method

Before that trigger a Pyspark session, Which by default will open a Spark session as below

#pyspark

Once the spark session is created, We can proceed with the DataFrame creation

data = [
    {'Name': 'Iron Man', 'Age': 25, 'City': 'New York'},
    {'Name': 'Captain America', 'Age': 30, 'City': 'Los Angeles'},
    {'Name': 'Hulk', 'Age': 35, 'City': 'Chicago'}
]

# Create a DataFrame from the sample data
df = spark.createDataFrame(data)

# Show the DataFrame
df.show()

Output

>>> data = [
...     {'Name': 'Iron Man', 'Age': 25, 'City': 'New York'},
...     {'Name': 'Captain America', 'Age': 30, 'City': 'Los Angeles'},
...     {'Name': 'Hulk', 'Age': 35, 'City': 'Chicago'}
... ]
>>> df = spark.createDataFrame(data)
>>> df.show()
+---+-----------+---------------+                                               
|Age|       City|           Name|
+---+-----------+---------------+
| 25|   New York|       Iron Man|
| 30|Los Angeles|Captain America|
| 35|    Chicago|           Hulk|
+---+-----------+---------------+

Step 2: Save DataFrame as CSV

In this step, We will be saving the DataFrame to a CSV file using write.csv() method. This method expects “path” as an input

df.write.csv("/tmp/file")

Replace "/tmp/file" with your actual path where you want to save the CSV files. By default, If you specify the path as /tmp/file it will save it in the HDFS location

Output:

# hadoop fs -ls /tmp/file.csv

-rw-r--r--   3 test_user supergroup          0 2023-06-14 15:05 /tmp/file/_SUCCESS
-rw-r--r--   3 test_user supergroup         21 2023-06-14 15:05 /tmp/file/part-00000.csv

# Reading the file directly from HDFS path
#hadoop fs -cat /tmp/file/part-00000.csv

25,New York,Iron Man
30,Los Angeles,Captain America
35,Chicago,Hulk

Also, By default, Spark includes a header row in the CSV file, which contains the column names. If you don’t want to include the header, you can disable it by setting the header option to False as below

df.write.option("header", "false").csv("/tmp/file")

Customizing CSV Writing Options

Spark provides various options to customize the CSV writing behavior. Below are some of the options

Delimiter

Using this delimiter option, We can choose the delimiter while saving it as a CSV file. But default Spark uses comma (,) as the delimiter, We can change this based on our need, Below example is to store the file with delimiter as pipe symbol (|)

df.write.option("delimiter", "|").csv("/tmp/file")

#Example
# hadoop fs -cat /tmp/file/part-00000.csv
25|New York|Iron Man
30|Los Angeles|Captain America
35|Chicago|Hulk

Encoding

You can specify the encoding of the output CSV file using the encoding option. By default, Spark uses the UTF-8 encoding. Below example it to specify the encoding as UTF-16:

df.write.option("encoding", "UTF-16").csv("/tmp/file")

Quote and Escape Characters

You can specify custom characters for quoting and escaping in the CSV file using the quote and escape options, respectively. below is the example

df.write.format("csv").option("quote", "'").option("escape", "\\").save("path/to/save")

Null Value

You can specify a custom string to represent null values in the CSV file using the nullValue option. By default, Spark uses an empty string. The below example shows to set “NA” for null values

df.write.format("csv").option("nullValue", "NA").save("path/to/save")

Other options

Spark provides many other options, such as dateFormat, timestampFormat, encoding, and more. You can refer to the Spark documentation for a comprehensive list of available options and their usage.

Modes for Handling Existing Output Paths

DataFrameWriter in Spark also provides a mode() method, which allows you to specify the behavior in case the output path already exists.

Append Mode

df.write.format("csv").mode("append").save("path/to/save")

In this Mode, if the provided output path already exists, the DataFrame will be appended to the existing data/file.

Overwrite Mode

df.write.format("csv").mode("overwrite").save("path/to/save")

In this Mode, if the provided output path already exists, it will overwrite the already available data with the DataFrame.

Ignore Mode

df.write.format("csv").mode("ignore").save("path/to/save")

In this Mode, if the provided output path already exists, the DataFrame will not be written, and no error will be thrown.

Error Mode (by Default)

df.write.format("csv").mode("error").save("path/to/save")

In this Mode, if the provided output path already exists, an error will be thrown, indicating that the write operation failed.

Handling Large DataFrames

Let’s say you are trying to convert a DataFrame with multiple partitions and a large dataset, Spark by default will write the CSV file as multiple part files in the specified directory based on the number of partitions and the size of the dataset. Each part file will have a part number appended to its name, such as part-00000, part-00001, and so on.

Example:

# hadoop fs -ls /tmp/file

-rw-r--r--   3 test_user supergroup          0 2023-06-14 15:05 /tmp/file.csv/_SUCCESS
-rw-r--r--   3 test_user supergroup         21 2023-06-14 15:05 /tmp/file.csv/part-00000-c000.csv
-rw-r--r--   3 test_user supergroup         47 2023-06-14 15:05 /tmp/file.csv/part-00001-c000.csv

Conclusion

In Summary, To Save a DataFrame as a CSV file in Spark can be done using the write.csv() method provided by the DataFrameWriter class. Along with we have to specify the path and various other options like delimiter, and saving mode (either to append to an existing file or create a new file ) which have been discussed in depth in this article.

Based on the pre-requisite, We need to have a Spark setup with Python installed, As I have tested the above examples with Python and in Cloudera Distribution Hadoop Cluster.

Good Luck with your Learning !!

Related Topics:

What is the difference between apache spark and pyspark

Difference between map and mapValues functions in Spark

How to read and write XML files using Spark?

How to Use Python Print to stderr

Difference between List and String in Python

Similar Posts