Delta table vacuum creates lots of spark tasks

scala spark

I tried to write to a local delta table with spark. 10000+ spark tasks are created when vacuuming the table though it’s very small with only 3 rows under 1 path. How come?

After reading the code of vacuum, I discovered that table vacuum recursively list the files under the path and then repartition according to spark.sql.sources.parallelPartitionDiscovery.parallelism setting. The default value of the parameter is 10000. So it should be set to a small number when local test. After listing, there’s a groupByKey step which incurs shuffle. Shuffle’s partition number is determined by spark.sql.shuffle.partitions with a default value of 200. It’s another config to check if the problem remains after setting parallelism.

Note: Delta table vacuum is a time-consuming operation. Please don’t call it every time after writing to a table. It should be a cleaning task scheduled like daily.