When initializing a new SparkContext
in PySpark, it is essential to specify both the master URL and the application name. The master
URL determines the cluster to connect to, while the application name is used to identify your application on the cluster. Since when creating a
SparkSession
the SparkContext
can be created implicitly, it is also important to set these parameters when creating a new
SparkSession
.
Failing to set these parameters can lead to unexpected behavior, such as connecting to an unintended cluster or having difficulty identifying your
application in the Spark UI.
A good default for the master URL for local development is local[*]
, which uses all available cores on your machine. Alternatively,
you can use local[n]
, where n
is the specific number of cores you want to allocate. However, in production environments, you
should specify the actual cluster URL (e.g., spark://host:port
or yarn
).
Exceptions
When using PySpark with AWS Glue, the master and name parameters are usually not set, since AWS Glue manages these configurations automatically.
Because of this, the rule doesn’t raise if awsglue
has been imported.
from pyspark.context import SparkContext
from awsglue.context import GlueContext
sc = SparkContext() # Compliant: used in the context of awsglue code
glueContext = GlueContext(sc)