Products
In-IDE
IDE extension that lets you fix coding issues before they exist!
Discover SonarQube for IDE
SaaS
Setup is effortless and analysis is automatic for most languages
Discover SonarQube Cloud
Self-Hosted
Fast, accurate analysis; enterprise scalability
Discover SonarQube Server

Python static code analysis

Unique rules to find Bugs, Vulnerabilities, Security Hotspots, and Code Smells in your PYTHON code

Filtered: 13 rules found

pyspark

Impact

Clean code attribute

The "subset" argument should be provided when using PySpark DataFrame "dropDuplicates" method

consistency - conventional

maintainability

reliability

Code Smell

Quick FixIDE quick fixes available with SonarLint

This rule raises an issue when no value is provided to the subset parameter of PySpark DataFrame’s dropDuplicates method.

Why is this an issue?

How can I fix it?

More Info

In PySpark, the dropDuplicates method is used to remove duplicate rows from a DataFrame. By default, if no column names are provided, dropDuplicates will consider all columns to identify duplicates.

This default is defensive and avoids removing rows that are partially similar, but it can also lead to:

unintended results. The simplest example would be to try removing duplicates on a DataFrame that holds a unique id per row. It is easy to forget that an id is part of a DataFrame and when trying to remove duplicates, the output DataFrame is the same as the input DataFrame. For example, applying dropDuplicates on the following DataFrame will not remove any rows:

+---+-----+---+
| id| name|age|
+---+-----+---+
|  1|Alice| 29|
|  2|  Bob| 29|
|  3|Alice| 29|
|  4|Alice| 30|
|  5|  Bob| 29|
+---+-----+---+

performance inefficiencies. Identifying duplicates is a very costly operation, as Spark has to compare each column of each row with each other.

To ensure clarity, prevent incorrect results, and optimize performance, it is a good practice to specify the column names when using dropDuplicates.

This rule will raise issues on pyspark.sql.DataFrame.dropDuplicates, pyspark.sql.DataFrame.drop_duplicates and pyspark.sql.DataFrame.dropDuplicatesWithinWaterMark.

Exceptions

If however, the intent is to remove duplicates based on all columns, the distinct method can be used, or the None value can be provided to the subset parameter. This way the intention is clear and this rule will not raise any issues.

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
data = ...

df = spark.createDataFrame(data, ["id", "name", "age"])

df_dedup = df.dropDuplicates(None) # Compliant
df_dedup = df.dropDuplicates(subset=None) # Compliant
df_dedup = df.distinct() # Compliant

Available In:

Catch issues on the fly,
in your IDE

Detect issues in your GitHub, Azure DevOps Services, Bitbucket Cloud, GitLab repositories

Analyze code in your
on-premise CI

In-IDE

SaaS

Self-Hosted