How to read and write XML files using Spark?

Spark is a powerful framework for processing large datasets in a distributed manner. In this article, we will discuss, how to read and write XML files in Spark, We are using Pyspark for all the examples discussed below

To read and write XML files in Spark, use the “com.databricks.spark.xml” format. For reading, create a SparkSession and use “read” with format and rowTag. For writing, create a DataFrame and use “write” with format, rootTag, and rowTag.

Read and Write XML files using Spark

Let’s discuss the following topics in detail

  • Create a Sample XML file
  • Reading XML files in PySpark
  • Writing XML files in PySpark
  • Verifying the data in XML files

Before going into the topics, Let’s understand more about XML

What is XML file format?

XML, or eXtensible Markup Language, is a file format, which is used to store and share data between different systems. It is a markup language, which means it uses tags to identify and organize data. These tags are similar to the HTML tags that are used to structure web pages, but XML is used for data storage

An XML file has a series of tags, each representing a different element of the data that is being stored. For example, Let’s say you want to store information about a group of people. We can create an XML file with a root element called “people”, and then add child elements for each person as below

<people>
  <person>
    <name>Learn</name>
    <age>32</age>
    <gender>Male</gender>
  </person>
  <person>
    <name>Share</name>
    <age>27</age>
    <gender>Female</gender>
  </person>
</people>

In this example, the root element is “people”, and it contains two child elements, each points a different person. Within each person element, there are child elements for the person’s name, age, and gender.

XML files are commonly used for data exchange between different software systems, as they provide a standardized format for organizing and storing data that can be easily interpreted by different applications.

Create a Sample XML file

To run this below example, We need to create a sample XML file, Use this link to download sample XML files to run all the given examples

<?xml version="1.0"?>
<catalog>
   <book id="bk101">
      <author>Gambardella, Matthew</author>
      <title>XML Developer's Guide</title>
      <genre>Computer</genre>
      <price>44.95</price>
      <publish_date>2000-10-01</publish_date>
      <description>An in-depth look at creating applications 
      with XML.</description>
   </book>
   <book id="bk102">
      <author>Ralls, Kim</author>
      <title>Midnight Rain</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2000-12-16</publish_date>
      <description>A former architect battles corporate zombies, 
      an evil sorceress, and her own childhood to become queen 
      of the world.</description>
   </book>
   <book id="bk103">
      <author>Corets, Eva</author>
      <title>Maeve Ascendant</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2000-11-17</publish_date>
      <description>After the collapse of a nanotechnology 
      society in England, the young survivors lay the 
      foundation for a new society.</description>
   </book>
   <book id="bk104">
      <author>Corets, Eva</author>
      <title>Oberon's Legacy</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2001-03-10</publish_date>
      <description>In post-apocalypse England, the mysterious 
      agent known only as Oberon helps to create a new life 
      for the inhabitants of London. Sequel to Maeve 
      Ascendant.</description>
   </book>
   <book id="bk105">
      <author>Corets, Eva</author>
      <title>The Sundered Grail</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2001-09-10</publish_date>
      <description>The two daughters of Maeve, half-sisters, 
      battle one another for control of England. Sequel to 
      Oberon's Legacy.</description>
   </book>
   <book id="bk106">
      <author>Randall, Cynthia</author>
      <title>Lover Birds</title>
      <genre>Romance</genre>
      <price>4.95</price>
      <publish_date>2000-09-02</publish_date>
      <description>When Carla meets Paul at an ornithology 
      conference, tempers fly as feathers get ruffled.</description>
   </book>
   <book id="bk107">
      <author>Thurman, Paula</author>
      <title>Splish Splash</title>
      <genre>Romance</genre>
      <price>4.95</price>
      <publish_date>2000-11-02</publish_date>
      <description>A deep sea diver finds true love twenty 
      thousand leagues beneath the sea.</description>
   </book>
</catalog>

Reading XML Files in PySpark

Reading XML files with PySpark involves a few simple steps:

NOTE: Running the below example without including the spark-xml_2.11 jar will lead to the below error message. Make sure to follow the steps to download and add the jar, while triggering the pyspark application

: java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.xml. Please find packages at http://spark.apache.org/third-party-projects.html

Step1: Download “spark-xml_2.11” jar

Using the link, Download the latest “spark-xml_2.11” jar

Example:

wget https://repo1.maven.org/maven2/com/databricks/spark-xml_2.11/0.10.0/spark-xml_2.11-0.10.0.jar

Step2: Create a Spark interactive session

We need to create a spark session with “spark-xml_2.11-0.10.0.jar” jar added in the classpath, In this way, We can leverage the XML processing power. I am using Pyspark for all the examples

pyspark --jars spark-xml_2.11-0.4.1.jar

Python 2.7.5 (default, Oct 14 2020, 14:45:30) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-44)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/04/04 15:09:57 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors before the AM has registered!
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.4.0
Using Python version 2.7.5 (default, Oct 14 2020 14:45:30)
SparkSession available as 'spark'.
>>>

In the above example, SparkSession is created by default, If not, We need to create the session manually as below

spark = SparkSession.builder.appName("XML processor").getOrCreate()

Step3: Import the necessary PySpark modules

from pyspark.sql import SparkSession

Step4: Read the XML file using the read method of the SparkSession object

df = spark.read.format("com.databricks.spark.xml").option("rowTag", "catalog").load("/tmp/book.xml")

In the above code, we specify the format as “com.databricks.spark.xml” and the rowTag as the name of the root element of the XML file, In this case, it is “catalog” . The load the method is used to read the file from the specified path

Step5: Show the resulting data

df.collect()

Output: I am using collect() action to show the rows

>>> df.collect()

[Row(book=[Row(_id=u'bk101', author=u'Gambardella, Matthew', description=u'An in-depth look at creating applications \n      with XML.', genre=u'Computer', price=44.95, publish_date=u'2000-10-01', title=u"XML Developer's Guide"), Row(_id=u'bk102', author=u'Ralls, Kim', description=u'A former architect battles corporate zombies, \n      an evil sorceress, and her own childhood to become queen \n      of the world.', genre=u'Fantasy', price=5.95, publish_date=u'2000-12-16', title=u'Midnight Rain'), Row(_id=u'bk103',

Writing XML Files in PySpark

Writing XML files with PySpark can be easily achieved by the below steps

Step1: Create a PySpark DataFrame with the data to be written

from pyspark.sql.types import StructType, StructField, StringType

schema = StructType([
    StructField("name", StringType()),
    StructField("age", StringType()),
    StructField("gender", StringType())
])

data = [("Learn", "25", "Male"), ("Share", "30", "Female")]
df = spark.createDataFrame(data, schema)

In the above code, we are creating a schema that specifies the structure of the data and then create a DataFrame with the specified schema and data. If you have an existing DataFrame, you can skip step1

Step 2: Write the DataFrame to an XML file using the write method of the DataFrame

df.write.format("com.databricks.spark.xml").option("rootTag", "root").option("rowTag", "person").save("/tmp/person_")

Output:

]$ hadoop fs -cat /tmp/person_/part-00000
<root>
    <person>
        <name>Learn</name>
        <age>25</age>
        <gender>Male</gender>
    </person>

</root>

Here, we specify the format as “com.databricks.spark.xml”, the rootTag as the name of the root element of the XML file, and the rowTag as the name of the element for each row in the DataFrame. The save method is used to write the file to the specified path.

NOTE: Above code will create a directory named “/tmp/person_” and your XML files will be divided into multiple files based on the number of executors running. If you need one final output file, you can try

df.repartition(1)

Verifying the Data in XML Files

We can easily, verify whether the data was written correctly to the XML file or not, It can be achieved by reading the file back into PySpark using the same read method. This will give us a DataFrame in PySpark that we can use to further process/transformation the data. Currently, We are just using show() to display the results

>>> xml_data = spark.read.format("com.databricks.spark.xml").option("rowTag", "person").load("/tmp/person_")
>>> xml_data.show()  

Output:                                                           
+---+------+----+                                                               
|age|gender|name|
+---+------+----+
| 25|  Male|Learn|
| 30|Female|Share|
+---+------+----+

>>> 

This will show the data from the XML file as a DataFrame in PySpark.

Conclusion

In Conclusion, Spark provides built-in support for processing XML files. Reading XML files involves specifying the format and rowTag of the XML file and writing XML files involves specifying the format, rootTag, and rowTag of the output file. With these simple steps, we can easily incorporate XML files into our Spark data processing pipelines

If you have any follow-up questions or you are facing issues with the above examples, Please feel free to start a discussion using the comment section

Good Luck with your Learning !!

Similar Posts