How to read and write XML files using Spark?
Spark is a powerful framework for processing large datasets in a distributed manner. In this article, we will discuss, how to read and write XML files in Spark, We are using Pyspark for all the examples discussed below
To read and write XML files in Spark, use the “com.databricks.spark.xml” format. For reading, create a SparkSession and use “read” with format and rowTag. For writing, create a DataFrame and use “write” with format, rootTag, and rowTag.

Let’s discuss the following topics in detail
- Create a Sample XML file
- Reading XML files in PySpark
- Writing XML files in PySpark
- Verifying the data in XML files
Before going into the topics, Let’s understand more about XML
What is XML file format?
XML, or eXtensible Markup Language, is a file format, which is used to store and share data between different systems. It is a markup language, which means it uses tags to identify and organize data. These tags are similar to the HTML tags that are used to structure web pages, but XML is used for data storage
An XML file has a series of tags, each representing a different element of the data that is being stored. For example, Let’s say you want to store information about a group of people. We can create an XML file with a root element called “people”, and then add child elements for each person as below
<people>
<person>
<name>Learn</name>
<age>32</age>
<gender>Male</gender>
</person>
<person>
<name>Share</name>
<age>27</age>
<gender>Female</gender>
</person>
</people>
In this example, the root element is “people”, and it contains two child elements, each points a different person. Within each person element, there are child elements for the person’s name, age, and gender.
XML files are commonly used for data exchange between different software systems, as they provide a standardized format for organizing and storing data that can be easily interpreted by different applications.
Create a Sample XML file
To run this below example, We need to create a sample XML file, Use this link to download sample XML files to run all the given examples
<?xml version="1.0"?> <catalog> <book id="bk101"> <author>Gambardella, Matthew</author> <title>XML Developer's Guide</title> <genre>Computer</genre> <price>44.95</price> <publish_date>2000-10-01</publish_date> <description>An in-depth look at creating applications with XML.</description> </book> <book id="bk102"> <author>Ralls, Kim</author> <title>Midnight Rain</title> <genre>Fantasy</genre> <price>5.95</price> <publish_date>2000-12-16</publish_date> <description>A former architect battles corporate zombies, an evil sorceress, and her own childhood to become queen of the world.</description> </book> <book id="bk103"> <author>Corets, Eva</author> <title>Maeve Ascendant</title> <genre>Fantasy</genre> <price>5.95</price> <publish_date>2000-11-17</publish_date> <description>After the collapse of a nanotechnology society in England, the young survivors lay the foundation for a new society.</description> </book> <book id="bk104"> <author>Corets, Eva</author> <title>Oberon's Legacy</title> <genre>Fantasy</genre> <price>5.95</price> <publish_date>2001-03-10</publish_date> <description>In post-apocalypse England, the mysterious agent known only as Oberon helps to create a new life for the inhabitants of London. Sequel to Maeve Ascendant.</description> </book> <book id="bk105"> <author>Corets, Eva</author> <title>The Sundered Grail</title> <genre>Fantasy</genre> <price>5.95</price> <publish_date>2001-09-10</publish_date> <description>The two daughters of Maeve, half-sisters, battle one another for control of England. Sequel to Oberon's Legacy.</description> </book> <book id="bk106"> <author>Randall, Cynthia</author> <title>Lover Birds</title> <genre>Romance</genre> <price>4.95</price> <publish_date>2000-09-02</publish_date> <description>When Carla meets Paul at an ornithology conference, tempers fly as feathers get ruffled.</description> </book> <book id="bk107"> <author>Thurman, Paula</author> <title>Splish Splash</title> <genre>Romance</genre> <price>4.95</price> <publish_date>2000-11-02</publish_date> <description>A deep sea diver finds true love twenty thousand leagues beneath the sea.</description> </book> </catalog>
Reading XML Files in PySpark
Reading XML files with PySpark involves a few simple steps:
NOTE: Running the below example without including the spark-xml_2.11 jar will lead to the below error message. Make sure to follow the steps to download and add the jar, while triggering the pyspark application
: java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.xml. Please find packages at http://spark.apache.org/third-party-projects.html
Step1: Download “spark-xml_2.11” jar
Using the link, Download the latest “spark-xml_2.11” jar
Example:
wget https://repo1.maven.org/maven2/com/databricks/spark-xml_2.11/0.10.0/spark-xml_2.11-0.10.0.jar
Step2: Create a Spark interactive session
We need to create a spark session with “spark-xml_2.11-0.10.0.jar” jar added in the classpath, In this way, We can leverage the XML processing power. I am using Pyspark for all the examples
pyspark --jars spark-xml_2.11-0.4.1.jar Python 2.7.5 (default, Oct 14 2020, 14:45:30) [GCC 4.8.5 20150623 (Red Hat 4.8.5-44)] on linux2 Type "help", "copyright", "credits" or "license" for more information. Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 23/04/04 15:09:57 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors before the AM has registered! Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 2.4.0 Using Python version 2.7.5 (default, Oct 14 2020 14:45:30) SparkSession available as 'spark'. >>>
In the above example, SparkSession is created by default, If not, We need to create the session manually as below
spark = SparkSession.builder.appName("XML processor").getOrCreate()
Step3: Import the necessary PySpark modules
from pyspark.sql import SparkSession
Step4: Read the XML file using the read
method of the SparkSession object
df = spark.read.format("com.databricks.spark.xml").option("rowTag", "catalog").load("/tmp/book.xml")
In the above code, we specify the format as “com.databricks.spark.xml” and the rowTag as the name of the root element of the XML file, In this case, it is “catalog
” . The load
the method is used to read the file from the specified path
Step5: Show the resulting data
df.
collect()
Output: I am using collect() action to show the rows
>>> df.collect() [Row(book=[Row(_id=u'bk101', author=u'Gambardella, Matthew', description=u'An in-depth look at creating applications \n with XML.', genre=u'Computer', price=44.95, publish_date=u'2000-10-01', title=u"XML Developer's Guide"), Row(_id=u'bk102', author=u'Ralls, Kim', description=u'A former architect battles corporate zombies, \n an evil sorceress, and her own childhood to become queen \n of the world.', genre=u'Fantasy', price=5.95, publish_date=u'2000-12-16', title=u'Midnight Rain'), Row(_id=u'bk103',
Writing XML Files in PySpark
Writing XML files with PySpark can be easily achieved by the below steps
Step1: Create a PySpark DataFrame with the data to be written
from pyspark.sql.types import StructType, StructField, StringType
schema = StructType([
StructField("name", StringType()),
StructField("age", StringType()),
StructField("gender", StringType())
])
data = [("Learn", "25", "Male"), ("Share", "30", "Female")]
df = spark.createDataFrame(data, schema)
In the above code, we are creating a schema that specifies the structure of the data and then create a DataFrame with the specified schema and data. If you have an existing DataFrame, you can skip step1
Step 2: Write the DataFrame to an XML file using the write
method of the DataFrame
df.write.format("com.databricks.spark.xml").option("rootTag", "root").option("rowTag", "person").save("/tmp/person_")
Output:
]$ hadoop fs -cat /tmp/person_/part-00000
<root>
<person>
<name>Learn</name>
<age>25</age>
<gender>Male</gender>
</person>
</root>
Here, we specify the format as “com.databricks.spark.xml”, the rootTag as the name of the root element of the XML file, and the rowTag as the name of the element for each row in the DataFrame. The save
method is used to write the file to the specified path.
NOTE: Above code will create a directory named “/tmp/person_” and your XML files will be divided into multiple files based on the number of executors running. If you need one final output file, you can try
df.repartition(1)
Verifying the Data in XML Files
We can easily, verify whether the data was written correctly to the XML file or not, It can be achieved by reading the file back into PySpark using the same read
method. This will give us a DataFrame in PySpark that we can use to further process/transformation the data. Currently, We are just using show() to display the results
>>> xml_data = spark.read.format("com.databricks.spark.xml").option("rowTag", "person").load("/tmp/person_")
>>> xml_data.show()
Output:
+---+------+----+
|age|gender|name|
+---+------+----+
| 25| Male|Learn|
| 30|Female|Share|
+---+------+----+
>>>
This will show the data from the XML file as a DataFrame in PySpark.
Conclusion
In Conclusion, Spark provides built-in support for processing XML files. Reading XML files involves specifying the format and rowTag of the XML file and writing XML files involves specifying the format, rootTag, and rowTag of the output file. With these simple steps, we can easily incorporate XML files into our Spark data processing pipelines
If you have any follow-up questions or you are facing issues with the above examples, Please feel free to start a discussion using the comment section
Good Luck with your Learning !!