Open
Description
Backend
VL (Velox)
Bug description
The total time across all tasks for vanilla Spark was 78.1 hours, but for gluten it reached 1899.3 hours. The flame graph shows that the majority of time is occupied by the merging of payloads. After adding some logs, I see merge operation occurred 1522330 times for 1084128 rows in a task, with each instance taking a few milliseconds.
gluten shuffle metrics:
shuffle records written: 39,407,858,231
shuffle write time total (min, med, max (stageId: taskId))
17.86 h (0 ms, 27.4 s, 17.1 m (stage 0.0: task 499))
time to compress total (min, med, max (stageId: taskId))
31.52 h (0 ms, 29.6 s, 20.3 m (stage 0.0: task 1191))
time to split total (min, med, max (stageId: taskId))
1478.09 h (0 ms, 45.5 m, 1.71 h (stage 0.0: task 86))
time to spill total (min, med, max (stageId: taskId))
15.77 h (0 ms, 25.2 s, 16.3 m (stage 0.0: task 499))
shuffle schema:

Gluten version
Gluten-1.3
Spark version
Spark-3.5.x
Spark configurations
No response
System information
No response