In PySpark, the dropDuplicates
method is used to remove duplicate rows from a DataFrame. By default, if no column names are provided,
dropDuplicates
will consider all columns to identify duplicates.
This default is defensive and avoids removing rows that are partially similar, but it can also lead to:
- unintended results. The simplest example would be to try removing duplicates on a DataFrame that holds a unique id per row. It is easy to
forget that an id is part of a DataFrame and when trying to remove duplicates, the output DataFrame is the same as the input DataFrame. For example,
applying
dropDuplicates
on the following DataFrame will not remove any rows:
+---+-----+---+
| id| name|age|
+---+-----+---+
| 1|Alice| 29|
| 2| Bob| 29|
| 3|Alice| 29|
| 4|Alice| 30|
| 5| Bob| 29|
+---+-----+---+
- performance inefficiencies. Identifying duplicates is a very costly operation, as Spark has to compare each column of each row with each other.
To ensure clarity, prevent incorrect results, and optimize performance, it is a good practice to specify the column names when using
dropDuplicates
.
This rule will raise issues on pyspark.sql.DataFrame.dropDuplicates
, pyspark.sql.DataFrame.drop_duplicates
and
pyspark.sql.DataFrame.dropDuplicatesWithinWaterMark
.
Exceptions
If however, the intent is to remove duplicates based on all columns, the distinct
method can be used, or the None
value
can be provided to the subset
parameter. This way the intention is clear and this rule will not raise any issues.
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
data = ...
df = spark.createDataFrame(data, ["id", "name", "age"])
df_dedup = df.dropDuplicates(None) # Compliant
df_dedup = df.dropDuplicates(subset=None) # Compliant
df_dedup = df.distinct() # Compliant