|Title||:||Analyzing Recursive Task Parallel Programs|
|Speaker||:||Suyash Gupta (IITM)|
|Details||:||Thu, 16 Oct, 2014 2:00 PM @ BSB 361|
|Abstract:||:||Extracting useful parallelism from the ideal in recursive task parallel (RTP) programs is quite challenging. In this work, we present two symbiotic optimizations, designed to optimize RTP programs to reduce the task creation and termination operations. Our first optimization Dynamic Load-Balanced loop Chunking (DLBC) extends the prior work on loop chunking to decide on the number of parallel tasks based on the number of available worker threads, at runtime. The second optimization Aggressive Finish-Elimination (AFE) significantly reduces the dynamic redundant join operations. Further, we discuss the impact of exceptions on our optimizations and extend them to handle RTP programs that may throw exceptions. We present DCAFE (an extension to the X10v2.3 compiler) that performs AFE followed by DLBC.
To test the effectiveness of DCAFE we present a new benchmark suite – IMSuite: IIT Madras benchmark suite for simulating distributed algorithms. IMSuite consists of twelve classical algorithms that help to identify with real-world parallel and distributed applications. The common denominator of these real-world applications are their underlying algorithms; IMSuite algorithms act as a representative subset. We present multiple variations of our kernels, broadly categorized under two heads: (a) varying synchronization primitives (with and without fine grain synchronization primitives); and (b) varying forms of parallelization (data parallel and recursive task parallel). Our characterization covers interesting aspects of parallel and distributed applications such as distribution of remote communication requests, number of synchronization, task creation, task termination and atomic operations. We study the behavior (execution time) of our kernels by varying the problem size, the number of compute threads, and the input configurations. We also present an involved set of input generators and output validators.
We tested DCAFE over a list of benchmark kernels on two different hardware systems (a 16-core Intel system and a 64-core AMD system). With respect to the base X10 compiler extended with prior work on loop-chunking, DCAFE achieved a geometric mean speed up of 5.75× and 4.16× on the Intel and AMD system, respectively. We also present an evaluation with respect to the energy consumption on the Intel system and show that on average, compared to the LC versions, the DCAFE versions consume 71.2% less energy.