Products
In-IDE
IDE extension that lets you fix coding issues before they exist!
Discover SonarQube for IDE
SaaS
Setup is effortless and analysis is automatic for most languages
Discover SonarQube Cloud
Self-Hosted
Fast, accurate analysis; enterprise scalability
Discover SonarQube Server

Python static code analysis

Unique rules to find Bugs, Vulnerabilities, Security Hotspots, and Code Smells in your PYTHON code

Filtered: 24 rules found

data-science

Impact

Clean code attribute

PySpark DataFrames used multiple times should be cached or persisted

intentionality - efficient

maintainability

Code Smell

This rule raises an issue when a PySpark DataFrame is used multiple times without being cached using the .cache() method.

Why is this an issue?

How can I fix it?

More Info

In Spark, transformations on DataFrames are lazy, meaning they are not executed until an action (like count, collect, etc.) is called. If you perform multiple actions on the same DataFrame without caching or persisting it, Spark will recompute the entire lineage of transformations for each action. By caching or persisting the DataFrame, you store the result of the transformations, avoiding the need to recompute them each time.

For this reason, DataFrames that are reused across multiple functions or operations should be cached using the .cache() method. This practice helps to prevent unnecessary recomputations, which can be resource-intensive and time-consuming. By caching DataFrames, you can leverage Spark’s in-memory computation capabilities to enhance performance. This also reduces the need to read data from the original source repeatedly.

If the DataFrame is too large to fit into memory, consider using .persist() with an appropriate storage level instead of .cache().

This rule will trigger an issue when 3 or more actions are performed on the DataFrame without it being cached, or when an action is performed within a loop.

Available In:

Catch issues on the fly,
in your IDE

Detect issues in your GitHub, Azure DevOps Services, Bitbucket Cloud, GitLab repositories

Analyze code in your
on-premise CI

In-IDE

SaaS

Self-Hosted