Spark is a data processing framework that helps to process data faster. It uses in-memory and multiple nodes to run the task across the cluster and provide better performance. However, As we are seeing the dataset is keep growing and it will eventually reduce the performance.
Having said that to enhance spark’s performance, There are two unique features that spark offers
- Caching -> Using cache() and persist() functions, We can save the data and data lineage
- Checkpointing -> Using checkpointing() function, We can save the data, without the lineage
It is very important to choose the correct technique in the right spot to enhance the performance and will be discussing this in detail below
When you are using a dataset more than once in a single execution time, We should be caching the data in the memory, This would help to get the dataset faster in the next consecutive execution
Spark caching can happen in 3 levels
Objects are converted into compact byte streams to reduce memory consumption, But this has the drawback of requiring more processing power to deserialize the data again.
The dataset will be stored as it is in the memory, it is faster to process, but requires more memory than the serialized object (2 – 10 x more memory) than serialized objects
This will utilize the Off heap memory to avoid memory usage. But it is slower to get and process the data
- Cache or persist methods can be used with the following options.
It is the default option, It will store as deserialized Objects in the memory. But if the RDD is not fit in the memory it will not be cached
It is similar to the MEMORY_ONLY, But it will store the serialized Object. It will result in utilizing less memory But will consume more processing power for deserializing the objects
It is similar to MEMORY_ONLY, But it will serialize the RSS on the disk when the spark runs out of memory. It will be slower as the disk is slower to read
It is serialized version of MEMORY_ONLY_SER
It will store the RDD in the disk, Slower than all other options
Checkpointing helps to resolve the re-computation problem. Data Checkpointing is mandatory if the application is using any stateful operation otherwise once the application crashes all the states would be lost. Metadata checkpoints use progress information to recover from failures of the driver running the application.
Quick not on what is a stateful and stateless process
Stateless operations, such as filter(), map(), and flatMap(), do not keep data around, When passing from one event to another event
Stateful operations, such as distinct(), limit(), sorted(), reduce(), and collect(), may pass the state from the previously processed event to the processing of the next event.
- Enabling checkpointing
- Data checkpointing that saves intermediate data to reliable storage (HDFS)
- Metadata checkpointing, which stores file names and other configuration data
- Lineage will be “reset” to the point of the last checkpoint occurs
- Checkpointing limitations:
- Performed at the RDD level, not the application, level.
- There is an expense to persist to HDFS, but this can be overshadowed by the benefits based on the use case.
- No automatic cleanup of HDFS files.
- When to use Checkpointing:
- Window and other stateful streaming application transformations require it.
- Iterative applications that may loop through data many times.
Checkpointing vs Caching
- Saves a permanent copy of the intermediate data.
- Lineage is then rebuilt from the intermediate data.
- If data is lost, recompute the data from intermediate data.
- Checkpointing an RDD/data frame is similar to how Hadoop stores intermediate computation values to disk, It has a trading-off execution latency with ease of recovering from failures.
- Since an RDD/data frame is checkpointed in an external storage system, it can be reused by other applications.
- Data is stored in memory somewhere temporarily.
- Lineage is preserved.
- If data is lost, It has to recompute from base data.
- An RDD/dataframe that is cached lives within the context of the running application, and once the application terminates, cached RDDs/dataframes are deleted as well.
Use checkpointing or caching based on the use case at different points of the job. It takes longer to read and write a checkpointed RDD/data frame simply because it has to be persisted to an external storage system, but Spark worker failures need not result in a recomputation (assuming the data is intact in the external storage system).
On the other hand, cached RDD/data frame will not permanently take up storage space, but recomputation is necessary for worker failure.
Having said that, the Length of time it takes to do a computation is a good indicator to use the cache or the checkpoint. If the recomputation can be done in minutes then don’t use checkpointing.