This webpage is no longer updated. Please visit my personal webpage.
"Although this may seem a paradox, all exact science is dominated by the idea of approximation."
"It is the mark of an instructed mind to rest assured with that degree of precision that the nature of the subject admits, and not to seek exactness when only an approximation of the truth is possible. "
My dissertation research focuses on accelerating parallel graph processing using approximate computing. I design techniques for improving the efficiency of graph analytics on Graphics Processing Units (GPUs) by trading-off computational accuracy. When the graph sizes are huge (e.g., billion-scale networks), or the underlying processing is expensive, it is not always feasible to compute the exact solution in time. A practical question to ask in such scenarios is: "given an infrastructure, how to make the best use of resources to obtain an acceptable, inexact solution in a reasonable time?" I seek to answer this question by bringing parallelism and approximate computing into a convergent trajectory to make heavy-weight graph computation practical and scalable.
Parallelizing graph algorithms on GPUs is challenging due to the irregular memory accesses involved in graph traversals. In particular, three important GPU-specific aspects affect performance: memory coalescing, memory latency, and thread divergence. In this work, we attempt to tame these challenges using approximate computing. We target graph applications on GPUs that can tolerate some degradation in the quality of the output, for obtaining the result in short order. We propose three techniques for boosting the performance of graph processing on the GPU by injecting approximations in a controlled manner. The first one creates a graph isomorph that brings relevant nodes nearby in memory and adds controlled replica of nodes to improve coalescing. The second reduces memory latency by facilitating the processing of subgraphs inside shared memory by adding edges among specific nodes and processing well-connected subgraphs iteratively inside shared memory. The third technique normalizes degrees across nodes assigned to a warp to reduce thread divergence. Each of the techniques offers notable performance benefits, and provides a knob to control the amount of inaccuracy added to an execution. We demonstrate the effectiveness of the proposed techniques using a suite of five large graphs with varied characteristics and five popular graph algorithms.
Parallelizing graph algorithms on GPUs is challenging due to the irregular memory accesses and control-flow involved in graph traversals. In this work, we tame these challenges by injecting approximations. In particular, we improve memory coalescing by renumbering and replicating nodes, memory latency by adding edges among specific nodes brought into shared memory, and thread-divergence by normalizing degrees across nodes assigned to a warp. Using a suite of graphs with varied characteristics and five popular algorithms, we demonstrate the effectiveness of our proposed techniques. Our approximations for coalescing, memory latency and thread-divergence lead to mean speedups of 1.3×, 1.41× and 1.06× achieving accuracies of 83%, 78% and 84%, respectively.
Graph algorithms are widely used in several application domains. It has been established that parallelizing graph algorithms is challenging. The parallelization issues get exacerbated when graphics processing units (GPUs) are used to execute graph algorithms. While the prior art has shown effective parallelization of several graph algorithms on GPUs, a few algorithms are still expensive. In this work, we address the scalability issues in graph parallelization. In particular, we aim to improve the execution time by tolerating a little approximation in the computation. We study the effects of four heuristic approximations on six graph algorithms with five graphs and show that if an application allows for small inaccuracy, this can be leveraged to achieve considerable performance benefits. We also study the effects of the approximations on GPU-based processing and provide interesting takeaways.
Developed a standalone parallel implementation of (a part of) SixTrackLib. The contributed code is available here.
Developed SALLOC -- An arena based memory allocator for SIMT architectures.
SixTrack is a single-particle tracking code for high-energy circular accelerators routinely used at CERN for the Large Hadron Collider (LHC), its luminosity upgrade (HL-LHC), the Future Circular Collider (FCC) and the Super Proton Synchrotron (SPS) simulations. The code is based on a 6D symplectic tracking engine, which is optimized for long-term tracking simulations and delivers fully reproducible results on several platforms. It also includes multiple scattering engines for beam–matter interaction studies, as well as facilities to run the integrated simulations with external particle matter interaction codes. These features differentiate SixTrack from general-purpose, optics-design software. The code recently underwent a major restructuring to merge the advanced features into a single branch, such as multiple ion species, interface with external codes and high-performance input/output. This restructuring also removed a large number of compilation flags, instead enabling/disabling the functionality with runtime options. In the process, the code was moved from Fortran 77 to Fortran 2018 standard, also allowing and achieving a better modularization. Physics models (beam–beam effects, Radio-Frequency (RF) multipoles, current carrying wires, solenoid and electron lenses) and methods (symplecticity check) have also been reviewed and refined to offer more accurate results. The SixDesk runtime environment allows the user to manage the large batches of simulations required for accurate predictions of the dynamic aperture. SixDesk supports several cluster environments available at CERN as well as submitting jobs to the LHC@Home volunteering computing project, which enables volunteers contributing with their hardware to CERN simulation. SixTrackLib is a new library aimed at providing a portable and flexible tracking engine for single- and multi-particle problems using the models and formalism of SixTrack. The library is able to run in CPUs as well as graphical processing units (GPUs). This contribution presents the status of the code, summarizes the main existing features and provides details about the main development lines SixTrack, SixDesk and SixTrackLib.
SixTrack is a single-particle tracking code for high-energy circular accelerators routinely used at CERN for the Large Hadron Collider (LHC), its luminosity upgrade (HL-LHC), the Future Circular Collider (FCC), and the Super Proton Synchrotron (SPS) simulations. The code is based on a 6D symplectic tracking engine, which is optimised for long-term tracking simulations and delivers fully reproducible results on several platforms. It also includes multiple scattering engines for beam-matter interaction studies, as well as facilities to run integrated simulations with FLUKA and GEANT4. These features differentiate SixTrack from general-purpose, optics-design software like MAD-X. The code recently underwent a major restructuring to merge advanced features into a single branch, such as multiple ion species, interface with external codes, and high-performance input/output (XRootD, HDF5). This restructuring also removed a large number of build flags, instead enabling/disabling the functionality at run-time. In the process, the code was moved from Fortran 77 to Fortran 2018 standard, also allowing and achieving a better modularization. Physics models (beam-beam effects, RF-multipoles, current carrying wires, solenoid, and electron lenses) and methods (symplecticity check) have also been reviewed and refined to offer more accurate results. The SixDesk runtime environment allows the user to manage the large batches of simulations required for accurate predictions of the dynamic aperture. SixDesk supports CERN LSF and HTCondor batch systems, as well as the BOINC infrastructure in the framework of the LHC@Home volunteering computing project. SixTrackLib is a new library aimed at providing a portable and flexible tracking engine for single- and multi-particle problems using the models and formalism of SixTrack. The tracking routines are implemented in a parametrized C code that is specialised to run vectorized in CPUs and GPUs, by using SIMD intrinsics, OpenCL 1.2, and CUDA technologies. This contribution presents the status of the code and an outlook on future developments of SixTrack, SixDesk, and SixTrackLib.