In Spark, transformations on DataFrames are lazy, meaning they are not executed until an action (like count
, collect
,
etc.) is called. If you perform multiple actions on the same DataFrame
without caching or persisting it, Spark will recompute the entire
lineage of transformations for each action. By caching or persisting the DataFrame, you store the result of the transformations, avoiding the need to
recompute them each time.
For this reason, DataFrames that are reused across multiple functions or operations should be cached using the .cache()
method. This
practice helps to prevent unnecessary recomputations, which can be resource-intensive and time-consuming. By caching DataFrames
, you can
leverage Spark’s in-memory computation capabilities to enhance performance. This also reduces the need to read data from the original source
repeatedly.
If the DataFrame is too large to fit into memory, consider using .persist() with an appropriate storage level instead of .cache().
This rule will trigger an issue when 3 or more actions are performed on the DataFrame without it being cached, or when an action is performed
within a loop.