Difference between DataFrame, Dataset, and RDD in Spark
Short History of Spark:
— Spark was created in Berkeley back in 2009 — An evolution of the MapReduce concepts — Multiple Tasks per executor (JVM)
— Use of memory rather than disk for intermediate result sets
Apache Spark provides 3 types of APIs:
— RDD
— DATAFRAME
— DATASET
RDD:
RDDs can be created in four ways:
(1) by transforming an existing RDD
(2) from external data/file
(3) by converting a DataFrame or Dataset to RDD.
(4) from a parallelized collection
val rdd = sc.textfile(“<csv file path>”).map(Line => line.split(“\t”)).map(row => Person (row(0), row(1).toInt)).filter(_.age > 30)
DATAFRAME:
DataFrame is a distributed collection of data organized into named columns and is as same as a table in RDBMS
val df = spark.read.json("<path to csv file>") df.select(df("name"),df("age")).filter(df("age")>30)
DATASET:
A dataset is a data structure in SparkSQL which is strongly typed and is a map to a relational schema.
val df = spark.read.json("<path to csv file>") val ds: DataSet[Person] = df.as[Person] ds.filter(_.age > 30)
To try the above example. Trigger a ”spark-shell” and execute the example line by line
NOTE: Assuming the spark setup is done already
Good Luck with your Learning !!