A list of columns can be provided to the subset
argument of PySpark’s DataFrame.dropDuplicates
method. This will cause
the method to only consider the columns in the subset
argument when evaluating if a row is a duplicate. It is also possible to use all
columns of the DataFrame
by passing None
to the subset
argument or leaving it empty (as None
is
the default value). However when an empty list is provided to the subset
argument, dropDuplicates
does not perform any
deduplication but instead removes all row except one, which can lead to unexpected results and potentially incorrect data analysis. This rule ensures
that DataFrame.dropDuplicates
is used correctly by specifying at least one column, or not specifying a column at all.
This rule will raise issues as well on DataFrame.drop_duplicates
and DataFrame.dropDuplicatesWithinWatermark
.