Difference between DataFrame, Dataset, and RDD in Spark

Short History of Spark:

— Spark was created in Berkeley back in 2009 — An evolution of the MapReduce concepts — Multiple Tasks per executor (JVM)

— Use of memory rather than disk for intermediate result sets

Apache Spark provides 3 types of APIs:

— RDD

— DATAFRAME

— DATASET

RDD:

RDDs can be created in four ways:

(1) by transforming an existing RDD

(2) from external data/file

(3) by converting a DataFrame or Dataset to RDD.

(4) from a parallelized collection

 val rdd = sc.textfile(“<csv file path>”).map(Line => line.split(“\t”)).map(row => Person (row(0), row(1).toInt)).filter(_.age > 30)

DATAFRAME:

DataFrame is a distributed collection of data organized into named columns and is as same as a table in RDBMS

val df = spark.read.json("<path to csv file>")
 df.select(df("name"),df("age")).filter(df("age")>30)

DATASET:

A dataset is a data structure in SparkSQL which is strongly typed and is a map to a relational schema.

val df = spark.read.json("<path to csv file>")
val ds: DataSet[Person] = df.as[Person]
ds.filter(_.age > 30)

To try the above example. Trigger a ”spark-shell” and execute the example line by line

NOTE: Assuming the spark setup is done already

Good Luck with your Learning !!

Similar Posts