AWS Glue

spark knowledge

AWS glue is a management ETL service running spark jobs. It supports 3 types of job:

spark batch, spark structure streaming and pyspark(python script).

The cost for each job is based on the resource used and the time consumed. Resource’s unit is DPU which is short for data processing unit. It’s a combination of memory and vcpu.

1DPU = 4 cores + 16gb

DPU indirectly influences the configuration of executors like executor-core, executor-memory and executor-num. I’m not sure the actual relationship between DPU and the automatically determined executor configurations. And it looks like 1 executor on 1 worker. Worker number determines the number of executors.

It’s able to set -conf spark.executor.memory=3g spark.executor.cores=1in job conf but spark.executor.instances doesn’t work. A way to make it work is to set the configuration directly in your code(This saying is from ChatGPT and not verified coz I don’t have a glue env on hand), like:

from awsglue.context import GlueContext
from pyspark.context import SparkContext

glueContext = GlueContext(SparkContext.getOrCreate())

# Adjust Spark configurations
sc = SparkContext()
sc._conf.set("spark.executor.instances", "10")  # Set the number of Spark Executors
sc._conf.set("spark.executor.memory", "4g")    # Set the memory for each executor
sc._conf.set("spark.executor.cores", "2")      # Set the number of CPU cores per executor