As the number of cores increases Non-Uniform Memory Access (NUMA) is becoming increasingly prevalent in general purpose machines. Effectively exploiting NUMA can significantly reduce memory access latency and thus runtime by 10-20%, and profiling provides information on how to optimise. Language-level NUMA profilers are rare, and mostly profile conventional languages executing on Virtual Machines. Here we profile, and develop new NUMA profilers for, a functional language executing on a runtime system.
We start by using existing OS and language level tools to systematically profile 8 benchmarks from the GHC Haskell nofib suite on a typical NUMA server (8 regions, 64 cores). We propose a new metric: NUMA access rate that allows us to compare the load placed on the memory system by different programs, and use it to contrast the benchmarks. We demonstrate significant differences in NUMA usage between computational and data-intensive benchmarks, e.g. local memory access rates of 23% and 30% respectively. We show that small changes to coordination behaviour can significantly alter NUMA usage, and for the first time quantify the effectiveness of the GHC 8.2 NUMA adaption.
We identify information not available from existing profilers and extend both the numaprof profiler, and the GHC runtime system to obtain three new NUMA profiles: OS thread allocation locality, GC count (per region and generation) and GC thread locality. The new profiles not only provide a deeper understanding of program memory usage, they also suggest ways that GHC can be adapted to better exploit NUMA architectures.
Sun 22 AugDisplayed time zone: Seoul change
18:00 - 19:30
|Welcome to FHPNC 2021|
|Generating High Performance Code for Irregular Data Structures using Dependent Types|
|Improving GHC Haskell NUMA Profiling|