How to run Spark job with Ozone Filesystem

Apache Spark is a popular distributed computing framework for big data processing and Ozone is a distributed object store that runs on top of Hadoop Distributed File System (HDFS) and provides an S3-like API, using that we can read and write data to ozone filesystem from spark framework

Checkout the article, I have written to compare both Ozone and HDFS file system performance and its use case

Table of Contents

Prerequisites

Before we begin, ensure you have the following:

Apache Hadoop Ozone installed and configured.

Spark libraries installed and configured.

In this article, We will see a step-by-step procedure to run a spark job with the Ozone filesystem (Cloudera distribution), Where all the data files will be available in the Ozone filesystem. Before proceeding with the spark job, We will see, How to upload data to the Ozone filesystem

Set up Ozone filesystem

Let’s create a volume and bucket in the Ozone filesystem these are called storage elements

1. Create a volume

]$ ozone sh volume create /volume

23/03/02 09:33:40 INFO rpc.RpcClient: Creating Volume: volume, with test_user as owner and space quota set to -1 bytes, counts quota set to -1

2. Create a bucket

$ ozone sh bucket create /volume/bucket

23/03/02 09:34:17 INFO rpc.RpcClient: Creating Bucket: volume/bucket, with the Bucket Layout null, test_user as owner, Versioning false, Storage Type set to DISK and Encryption set to false

3. Add the properties fs.o3fs.impl to core-site.xml.

<property>  
<name>fs.o3fs.impl</name>
<value>org.apache.hadoop.fs.ozone.OzoneFileSystem</value>
</property>

If you are using Cloudera distribution, You can add the property under HDFS Advance configuration core-site.xml

4. Run a Wordcount job with Ozone filesystem using pyspark

– Create a text file with words

vi /tmp/wordcount.txt

“Learn Share is a platform to share knowledge in which others can able to learn from your mistake and experience”

– Upload file to the Ozone filesystem

$ ozone sh key put /volume/bucket/wordcount.txt  /tmp/wordcount.txt

23/03/02 10:46:59 WARN impl.MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-xceiverclientmetrics.properties,hadoop-metrics2.properties

23/03/02 10:46:59 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).

23/03/02 10:46:59 INFO impl.MetricsSystemImpl: XceiverClientMetrics metrics system started

23/03/02 10:47:00 INFO metrics.MetricRegistries: Loaded MetricRegistries class org.apache.ratis.metrics.impl.MetricRegistriesImpl

– Access the file from the Ozone file system using hdfs cli command

hdfs dfs -ls o3fs://bucket.volume.<service id>/

We need to edit the bucket, volume name, and Ozone service id based on your cluster

NOTE: Check your Ozone configuration for service id “ozone.service.id”. In my cluster it is ozone

hdfs dfs -ls o3fs://bucket.volume.ozone/

Found 2 items

-rw-rw-rw-   3 learn learn         46 2023-03-02 09:36 o3fs://bucket.volume.ozone/data.csv

-rw-rw-rw-   3 learn learn        112 2023-03-02 10:47 o3fs://bucket.volume.ozone/wordcount.txt

– Create a pyspark session

$ pyspark

Python 2.7.5 (default, Oct 14 2020, 14:45:30) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-44)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/03/02 10:28:54 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors before the AM has registered!
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.4.8
      /_/
Using Python version 2.7.5 (default, Oct 14 2020 14:45:30)
SparkSession available as 'spark'.
>>>

– Run the Wordcount program by pointing the location to the Ozone filesystem

>>> data=sc.textFile("o3fs://bucket.volume.ozone/wordcount.txt")
>>> wordcount=data.flatMap(lambda line: line.split(" ")).map(lambda word: (word, 1)).reduceByKey(lambda x, y: x + y)
>>> output=wordcount.collect()
>>> for (word, count) in output:                                                
    print("%s: %i" % (word, count))

Output:

a: 1
and: 1
from: 1
able: 1
is: 1
share: 1
experience: 1
Share: 1
which: 1
Learn: 1
knowledge: 1
platform: 1
learn: 1
to: 2
can: 1
in: 1
your: 1
others: 1
mistake: 1

>>>

In the above example, We have a text file in the Ozone filesystem, and “data” is an RDD and we are using “map”, “flatmap”, “reducebykey” for transformations

Finally, We are initiating an action to collect the final result and print in a human-readable format

– Write to Ozone filesystem using pyspark

From the above example, Instead of printing it in the console, We can write the output back to Ozone filesystem as below

>>> data=sc.textFile("o3fs://bucket.volume.ozone/wordcount.txt")
>>> wordcount=data.flatMap(lambda line: line.split(" ")).map(lambda word: (word, 1)).reduceByKey(lambda x, y: x + y)
>>> wordcount.saveAsTextFile("o3fs://bucket.volume.ozone1/output")

$ hdfs dfs -ls o3fs://bucket.volume.ozone/output/

Found 3 items
-rw-rw-rw-   3 learn learn          0 2023-03-02 10:55 o3fs://bucket.volume.ozone1/output/_SUCCESS
-rw-rw-rw-   3 learn learn        120 2023-03-02 10:55 o3fs://bucket.volume.ozone1/output/part-00000
-rw-rw-rw-   3 learn learn        141 2023-03-02 10:55 o3fs://bucket.volume.ozone1/output/part-00001

$ hdfs dfs -cat o3fs://bucket.volume.ozone/output/part-00000

23/03/02 10:59:31 WARN impl.MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-xceiverclientmetrics.properties,hadoop-metrics2.properties
23/03/02 10:59:31 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
23/03/02 10:59:31 INFO impl.MetricsSystemImpl: XceiverClientMetrics metrics system started
(u'a', 1)
(u'and', 1)
(u'from', 1)
(u'able', 1)
(u'is', 1)
(u'share', 1)
(u'experience', 1)
(u'Share', 1)
(u'which', 1)

In the above example, We have found the wordcount from the given textfile and written the output back to the Ozone filesystem using saveAsTextFile method, Which writes the RDD to the Ozone path

Conclusion

In this article, we discussed how to run a Spark job with an Ozone filesystem. We first set up an Ozone cluster, configured Spark with Ozone, created a Spark session with Ozone configurations, and finally read and wrote data with Ozone using Spark code. With this knowledge, you can now leverage the power of Spark and Ozone to process big data efficiently.

For more information about the various Ozone command-line tools and the Ozone shell, check here

Good Luck with your Learning !!

How to run Spark job with Ozone Filesystem

Prerequisites

Set up Ozone filesystem

1. Create a volume

2. Create a bucket

3. Add the properties fs.o3fs.impl to core-site.xml.

4. Run a Wordcount job with Ozone filesystem using pyspark

– Create a text file with words

– Upload file to the Ozone filesystem

– Access the file from the Ozone file system using hdfs cli command

– Create a pyspark session

– Run the Wordcount program by pointing the location to the Ozone filesystem

– Write to Ozone filesystem using pyspark

Conclusion

How to set Apache Spark Executor and Driver Memory/Cores ( pyspark and spark-submit)

Resolve the “java.lang.OutOfMemoryError: Java heap space” issue in Spark and Hive(Tez and MR)

Spark Driver in Apache Spark and Where does the spark driver run?

Resolve “Job aborted due to stage failure” in Spark

Total size of serialized results of tasks (1024.5 MB) is bigger than spark.driver.maxResultSize

How to Save DataFrame as a CSV File in Spark

Prerequisites

Set up Ozone filesystem

1. Create a volume

2. Create a bucket

3. Add the properties fs.o3fs.impl to core-site.xml.

4. Run a Wordcount job with Ozone filesystem using pyspark

– Create a text file with words

– Upload file to the Ozone filesystem

– Access the file from the Ozone file system using hdfs cli command

– Create a pyspark session

– Run the Wordcount program by pointing the location to the Ozone filesystem

– Write to Ozone filesystem using pyspark

Conclusion

Similar Posts