Products
In-IDE
IDE extension that lets you fix coding issues before they exist!
Discover SonarQube for IDE
SaaS
Setup is effortless and analysis is automatic for most languages
Discover SonarQube Cloud
Self-Hosted
Fast, accurate analysis; enterprise scalability
Discover SonarQube Server

Secrets
ABAP
Ansible
Apex
AzureResourceManager
C
C#
C++
CloudFormation
COBOL
CSS
Dart
Docker
Flex
Go
HTML
Java
JavaScript
JCL
Kotlin
Kubernetes
Objective C
PHP
PL/I
PL/SQL
Python
RPG
Ruby
Rust
Scala
Swift
Terraform
Text
TypeScript
T-SQL
VB.NET
VB6
XML

Python static code analysis

Unique rules to find Bugs, Vulnerabilities, Security Hotspots, and Code Smells in your PYTHON code

Filtered: 24 rules found

data-science

Impact

Clean code attribute

"master" and "appName" should be set when constructing PySpark "SparkContext"s and "SparkSession"s
Code Smell
PySpark's "RDD.groupByKey", when used in conjunction with "RDD.mapValues" with a commutative and associative operation, should be replaced by "RDD.reduceByKey"
Code Smell
PySpark's "DataFrame" column names should be unique
Code Smell
PySpark "dropDuplicates" subset argument should not be provided with an empty list
Code Smell
Complex logic provided to PySpark "withColumn", "filter" and "when" methods should be refactored into separate expressions
Code Smell
PySpark lit(None) should be used when populating empty columns
Code Smell
PySpark DataFrame toPandas function should be avoided
Code Smell
The "how" parameter should be specified when joining two PySpark DataFrames
Code Smell
"withColumns" method should be preferred over "withColumn" when multiple columns are specified
Code Smell
PySpark DataFrames used multiple times should be cached or persisted
Code Smell
PySpark Pandas DataFrame columns should not use a reserved name
Code Smell
The "subset" argument should be provided when using PySpark DataFrame "dropDuplicates" method
Code Smell
PySpark Window functions should always specify a frame
Code Smell
pandas.pipe method should be preferred over long chains of instructions
Code Smell
The "pandas.DataFrame.to_numpy()" method should be preferred to the "pandas.DataFrame.values" attribute
Code Smell
'dtype' parameter should be provided when using 'pandas.read_csv' or 'pandas.read_table'
Code Smell
When using pandas.merge or pandas.join, the parameters on, how and validate should be provided
Code Smell
inplace=True should not be used when modifying a Pandas DataFrame
Code Smell
Deprecated NumPy aliases of built-in types should not be used
Code Smell
np.nonzero should be preferred over np.where when only the condition parameter is set
Code Smell
Passing a list to np.array should be preferred over passing a generator
Code Smell
numpy.random.Generator should be preferred to numpy.random.RandomState
Code Smell
Results that depend on random number generation should be reproducible
Code Smell
Floating point numbers should not be tested for equality
Bug

PySpark "dropDuplicates" subset argument should not be provided with an empty list

consistency - conventional

reliability

Code Smell

data-science
pyspark

This rule raises an issue when an empty list is provided to PySpark DataFrame.dropDuplicates, DataFrame.drop_duplicates or DataFrame.dropDuplicatesWithinWatermark.

Why is this an issue?

How can I fix it?

More Info

A list of columns can be provided to the subset argument of PySpark’s DataFrame.dropDuplicates method. This will cause the method to only consider the columns in the subset argument when evaluating if a row is a duplicate. It is also possible to use all columns of the DataFrame by passing None to the subset argument or leaving it empty (as None is the default value). However when an empty list is provided to the subset argument, dropDuplicates does not perform any deduplication but instead removes all row except one, which can lead to unexpected results and potentially incorrect data analysis. This rule ensures that DataFrame.dropDuplicates is used correctly by specifying at least one column, or not specifying a column at all.

This rule will raise issues as well on DataFrame.drop_duplicates and DataFrame.dropDuplicatesWithinWatermark.

Available In:

Catch issues on the fly,
in your IDE

Detect issues in your GitHub, Azure DevOps Services, Bitbucket Cloud, GitLab repositories

Analyze code in your
on-premise CI

In-IDE

SaaS

Self-Hosted