The PySpark API offers multiple ways of performing aggregation. When performing aggregations, data is usually shuffled between partitions. This
shuffling is needed to compute the result correctly. It has an associated cost that can impact performance, as shuffling moves data over the network
between Spark tasks.
There are however cases where some aggregation methods could be more efficient than others. For example when using RDD.groupByKey in
conjunction with RDD.mapValues if the function passed to RDD.mapValues is commutative and associative, it is preferable to
use RDD.reduceByKey instead. The performance gain from RDD.reduceByKey comes from the amount of data that needs to be moved
between PySpark tasks. RDD.reduceByKey will effectively reduce the number of rows in a partition before sending the data over the network
for further reduction. On the other hand, when using RDD.groupByKey with RDD.mapValues the reduction is only done after the
data has been moved around the cluster, effectively slowing down the computation process by transferring a higher amount of data over the network.