Using withColumn
multiple times can lead to inefficient code, as each call creates a new Spark Logical Plan. withColumns allows for
adding or modifying multiple columns in a single operation, improving performance.
What is the potential impact?
Creating a new column can be a costly operation, as Spark has to loop on every row to compute the new column value.
Exceptions
withColumn
can be used multiple times sequentially on a Dataframe when computing consecutive columns requires the presence of the
previous ones. In this case, consecutive withColumn
calls are a solution.
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([[1,2],[2,3]], ["id", "value"])
df_with_new_cols = df.withColumn("squared_value", col("value") * col("value")).withColumn("cubic_value", col("squared_value") * col("value")) # Compliant