Media and Publications

EP Analytics' work appears in national and international peer-reviewed literature and has been highlighted in HPC newsletters.

Characterization and Bottleneck Analysis of a 64-bit ARMv8 Platform accepted to ISPASS 2016

Abstract: This paper presents the first comprehensive study of the performance, power and energy efficiency of the Applied- Micro X-Gene, the first commercially available 64-bit ARMv8 platform. Our study includes a detailed comparison of the X-Gene to three other architectural design points common in HPC systems. Across these platforms, we perform careful measurements across 400+ workloads, covering different application domains, parallelization models, floating-point precision models, memory intensities, and several other features. We find that the X-Gene has 1.2× better energy consumption than an Intel Sandy Bridge, a design commonly found in HPC installations, while the Sandy Bridge is 2.3× faster.
Precisely quantifying the causes of performance and energy differences between two platforms is a challenging problem. This paper is the first to adopt a statistical framework called Partial Least Squares (PLS) Path Modeling to this problem. PLS Path Modeling allows us to capture complex cause-effect relationships and difficult-to-measure performance concepts relating to the effectiveness of architectural units and subsystems in improving application performance. Using PLS Path Modeling to quantify the causes of the performance differences between X-Gene and Sandy Bridge in the HPC domain, our efforts reveal that the performance of the memory subsystem is the dominant factor.

Michael Laurenzano, Ananta Tiwari, Allyson Cauble-Chantrenne, Adam Jundt, Roy Campbell†, and Laura Carrington
†High Performance Computing Modernization Program, U.S. Dept. of Defense

Accepted to: ISPASS (International Symposium on Performance Analysis of Systems and Software), 2016. Available upon request.

Building Blocks for a System-wide Power and Thermal Management Framework accepted to ICPADS 2015

Abstract: Next generation Exascale systems face the difficult challenge of managing the power and thermal constraints that come from packaging more transistors into a smaller space while adding more processors into a single system. To combat this, HPC center operators are looking for methodologies to save operational energy. Energy consumption in an HPC center is governed by the complex interactions between a number of different components. Without a coordinated and system-wide perspective on reducing energy consumption, isolated actions taken on one component with the intent to lower energy consumption can actually have the opposite effect on another component, thereby canceling out the net effect. For example, increasing the setpoint (or ambient temperature) to save cooling energy can lead to increased compute-node fan power and increased chip leakage power. This paper presents the building blocks required to develop and implement a system-wide framework that can take a coordinated approach to enact thermal and power management decisions at compute-node (e.g., CPU speed throttling) and infrastructure levels (e.g., selecting optimal setpoint). These building blocks consist of a suite of models that inform the thermal and power footprint of different computations, and present relationships between computational properties and datacenter operating conditions.

Ananta Tiwari, Adam Jundt, William A. Ward, Jr.†, Roy Campbell†, and Laura Carrington
†High Performance Computing Modernization Program, U.S. Dept. of Defense

Accepted to: ICPADS (International Conference on Parallel and Distributed Systems), 2015. Available upon request.

Compute Bottlenecks on the New 64-bit ARM accepted to E2SC 2015

Abstract: The trifecta of power, performance and programmability has spurred significant interest in the 64-bit ARMv8 platform. These new systems provide energy efficiency, a traditional CPU programming model, and the potential of high performance when enough cores are thrown at the problem. However, it remains unclear how well the ARM architecture will work as a design point for the High Performance Computing market. In this paper, we characterize and investigate the key architectural factors that impact power and performance on a current ARMv8 offering (X-Gene 1) and Intel’s Sandy Bridge processor. Using Principal Component Analysis, multiple linear regression models, and variable importance analysis we conclude that the CPU frontend has the biggest impact on performance on both the X-Gene and Sandy Bridge processors.

Adam Jundt, Allyson Cauble-Chantrenne, Ananta Tiwari, Joshua Peraza, Michael Laurenzano, and Laura Carrington

Accepted to: E2SC (Energy Efficient Supercomputing), 2015. Available upon request.

Performance and Energy Efficiency Analysis of 64-bit ARM Using GAMESS accepted to Co-HPC 2015

Abstract: Power efficiency is one of the key challenges facing the HPC co-design community, sparking interest in the ARM processor architecture as a low-power high-efficiency alternative to the high-powered systems that dominate today. Recent advances in the ARM architecture, including the introduction of 64-bit support, have only fueled more interest in ARM. While ARM-based clusters have proven to be useful for data server applications, their viability for HPC applications requires an in-depth analysis of on-node and inter-node performance. To that end, as a co-design exercise, the viability of a commercially available 64-bit ARM cluster is investigated in terms of performance and energy efficiency with the widely used quantum chemistry package GAMESS. The performance and energy efficiency metrics are also compared to a conventional x86 Intel Ivy Bridge system. A 2:1 Moonshot core to Ivy Bridge core performance ratio is observed for the GAMESS calculation types considered. Doubling the number of cores to complete the execution faster on the 64-bit ARM cluster leads to better energy efficiency compared to the Ivy Bridge system; i.e., a 32-core execution of GAMESS calculation has approximately the same performance and better energy-to-solution than a 16-core execution of the same calculation on the Ivy Bridge system.

Ananta Tiwari, Kristopher Keipert, Adam Jundt, Joshua Peraza, SaromS. Leang, Michael Laurenzano, Mark Gordon, and Laura Carrington

Accepted to: Co-HPC (International Workshop on Hardware-Software Co-Design for High Performance Computing), 2015. Available upon request.

Optimizing Codes on the Xeon Phi: A Case-study with LAMMPS accepted to XSEDE 2015

Abstract: Intel’s Xeon Phi co-processor has the potential to provide an impressive 4 GFlops/Watt while promising users that they need only to recompile their code to get it to run on the accelerator. This paper reports our experience on running LAMMPS, a widely-used molecular dynamics code, on the Xeon Phi and the steps we took to optimize its performance on the device. Using performance analysis tools to pinpoint bottlenecks in the code, we were able to achieve a speedup of 2.8x from running the original code on the host processors vs. the optimized code on the Xeon Phi. These optimizations also resulted in an improved LAMMPS’ performance on the host – speeding up the execution by 7x.

Adam Jundt, Ananta Tiwari, William Ward, Jr.†, Roy Campbell†, and Laura Carrington
†High Performance Computing Modernization Program, U.S. Dept. of Defense

Accepted to: XSEDE, 2015. Available upon request.

VecMeter: Measuring Vectorization on the Xeon Phi accepted to IEEE Cluster 2015

Abstract: Wide vector units in Intel’s Xeon Phi accelerator cards can significantly boost application performance when used effectively. However, there is a lack of performance tools that provide programmers accurate information about the level of vectorization in their codes. This paper presents VecMeter, an easy-to-use tool to measure vectorization on the Xeon Phi. VecMeter utilizes binary instrumentation so no source code modifications are necessary. This paper presents design details of VecMeter, demonstrates its accuracy, defines a metric for quantifying vectorization, and provides an example where the tool can guide optimization of some code sections to improve performance by up to 33%.

Joshua Peraza, Ananta Tiwari, William Ward, Jr.†, Roy Campbell†, and Laura Carrington
†High Performance Computing Modernization Program, U.S. Dept. of Defense

Accepted to: IEEE Cluster, 2015. Available upon request.

Power-performance Models for Runtime Reconfiguration and Power Capping accepted to MODSIM 2015.

Abstract: In future power-constrained systems, power capping and power shifting techniques will have to guarantee that a system operates within a given power envelope. However, it is only with integrated power and performance models that a controlling agent can optimize power draw and computing throughput of a system. In addition, these underlying workload-specific models must be simple and fast enough to support dynamic reconfigurations. We discuss our research and two modeling approaches that are currently under development: fine-grained instruction-level and coarse-grained statistical power/performance models. These models, which take application characteristics into account, are suitable for the design of proactive policies to steer systems towards maximizing compute throughput within a given power budget.
Instruction-level models are built by benchmarking assembly instructions in isolation, measuring power draw and latencies, and building an analytical model for both power and performance. These models can be queried at run-time by combining information from static code analysis and a few hardware performance counters.
The statistical modeling approach uses hardware performance counters and applies a combination of dimension reduction analysis and linear regression techniques to correlate power draw with workloads’ characteristics. This model can be queried at run-time by using a combination of hardware performance counters.

Pietro Cicotti, Ananta Tiwari, and Laura Carrington

Accepted to: MODSIM (Workshop on Modeling & Simulation of Systems and Applications. Workshop sponsored by the U.S. Department of Energy, Office of Advanced Scientific Computing Research.), 2015. Available upon request.

EP Analytics' paper on making the most out of SMT in HPC accepted to TACO and presented at HiPEAC in Amsterdam.

Abstract: This work presents an end-to-end methodology for quantifying the performance and power benefits of Simultaneous Multithreading (SMT) for HPC centers and applies this methodology to a production system and workload. Ultimately, SMT’s value system-wide depends on whether users effectively employ SMT at the application level. However, predicting SMT’s benefit for HPC applications is challenging; by doubling the number of threads, the application’s characteristics may change. This work proposes statistical modeling techniques to predict the speedup SMT confers to HPC applications. This approach, accurate to within 8%, uses only lightweight, transparent performance monitors collected during a single run of the application.

Leo Porter, Michael Laurenzano, Ananta Tiwari, Adam Jundt, William Ward, Jr.†, Roy Campbell†, and Laura Carrington
†High Performance Computing Modernization Program, U.S. Dept. of Defense

Accepted to: TACO (ACM Transactions on Architecture and Code Optimization), 2015. Available upon request.

EP Analytics' research on adaptive model-driven facility-wide management of energy efficiency and reliability has been accepted for publication at MODSIM, August 2014.

Abstract: We present the blueprint for the Energy Efficiency Management Platform (E2MP), a power-aware, green computing technology that can enact performance-neutral power and reliability management policies on high performance computing (HPC) centers. E2MP’s design allows it to take a system-wide, holistic view of power and reliability management and dynamically make fine-grain power and thermal adaptations at the compute node level and at the facility-level in response to the behavior of the applications running in the facility.
E2MP continuously monitors a number of important metrics, including chip temperature, instantaneous per-component power draw and ambient room temperature, then relates those metrics via predictive models to specific application software behavior (e.g., quantity of main-memory traffic) and uses those relationships to steer systems towards better energy efficiency and reliability.

Ananta Tiwari, Michael Laurenzano, Adam Jundt, William Ward, Jr.†, Roy Campbell†, and Laura Carrington
†High Performance Computing Modernization Program, U.S. Dept. of Defense

Accepted to: MODSIM (Workshop on Modeling & Simulation of Systems and Applications. Workshop sponsored by the U.S. Department of Energy, Office of Advanced Scientific Computing Research.), 2014. Available upon request.

Research on adaptive DVFS and clock modulation for energy efficiency accepted for publication at CLUSTER Computing Conference (Madrid, Spain) in September 2014.

Abstract: Meeting the 20MW power envelope sought for exascale is one of the greatest challenges in designing those class of systems. Addressing this challenge requires over-provisioned and dynamically reconfigurable system with fine-grained control on power and speed of the individual cores. In this paper, we present EfficientSpeed (ES), a library that improves energy efficiency in scientific computing by carefully selecting the speed of the processor. The run-time component of ES adjusts the speed of the processor (via DVFS and clock modulation) dynamically while preserving the desired level of the performance. These adjustments are based on online performance and energy measurements, user-selected policies that dictate the aggressiveness of adjustments, and user-defined performance requirements. Our results quantify the best energy savings that can be achieved by controlling the speed of the processor, with today’s technology, at the cost of negligible performance degradation. We then demonstrate that ES is effective in automatically calibrating the speed of execution in real applications, saving energy and meeting the desired performance goal. We evaluate ES on GAMESS, an ab initio quantum chemistry package. We show that ES respects the stipulated 5% performance loss bound and achieves 16% decrease in energy required to complete the execution while running with a power draw that is 18% lower.

Pietro Cicotti, Ananta Tiwari, and Laura Carrington

Accepted to: IEEE Cluster Computing Conference, 2014. Available upon request.

Our work on characterizing the performance-energy tradeoff of low-power ARM processors in HPC will appear at Euro-Par (Porto, Portugal) in June 2014.

Abstract: Deploying large numbers of small, low power cores has been gaining traction recently as a design strategy in high performance computing (HPC). The ARM platform that dominates the embedded and mobile computing segments is now being considered as an alternative to high-end x86 processors that largely dominate HPC because peak performance per watt may be substantially improved using off-the-shelf commodity processors. In this work we methodically characterize the performance and energy of HPC computations drawn from a number of problem domains on current ARM and x86 processors. Unsurprisingly, we find that the performance, energy and energy-delay product of applications running on these platforms varies significantly across problem types and inputs. Using static program analysis we further show that this variation can be explained largely in terms of the capabilities two processor subsystems: floating point/SIMD and the cache/memory hierarchy, and that static analysis of this kind is sufficient to predict which platform is best for a particular application/input pair. In the context of these findings, we evaluate how some of the key architectural changes being made for upcoming 64-bit ARM platforms may impact HPC application performance.

Michael Laurenzano, Ananta Tiwari, Adam Jundt, Joshua Peraza, Laura Carrington, William Ward, Jr.†, and Roy Campbell†
†High Performance Computing Modernization Program, U.S. Dept. of Defense

Accepted to: Euro-Par, 2014. Available upon request.

EP Analytics' co-founder and vice president of research, Laura Carrington, was interviewed for an article on upcoming 64-bit ARM servers

Article Highlight: "We are dedicated to helping our government and private sector clients maximize their return on investment in their high-performance computing systems... We have utilized Dell’s Copper server to conduct testing of high performance computing workloads on a 32-bit ARM-based system versus x86 systems. It is early days, but we are encouraged by the results so far and see significant potential for 64-bit ARM servers as an energy-efficient solution to high performance computing and big data applications." The full article can be found on Dell's Community Blog.

Work done on understanding the performance of stencil computations on Intel's Xeon Phi published in CLUSTER Computing Conference

Abstract: Accelerators are becoming prevalent in high performance computing as a way of achieving increased computational capacity within a smaller power budget. Effectively utilizing the raw compute capacity made available by these systems, however, remains a challenge because it can require a substantial investment of programmer time to port and optimize code to effectively use novel accelerator hardware. In this paper we present a methodology for isolating and modeling the performance of common performance-critical patterns of code (so-called idioms) and other relevant behavioral characteristics from large scale HPC applications which are likely to perform favorably on Intel Xeon Phi. The benefits of the methodology are twofold: (1) it directs programmer efforts toward the regions of code most likely to benefit from porting to the Xeon Phi and (2) provides speedup estimates for porting those regions of code. We then apply the methodology to the stencil idiom, showing performance improvements of up to a factor of 4.7x on stencil-based benchmark codes.

Joshua Peraza, Ananta Tiwari, Michael Laurenzano, Laura Carrington, William Ward, Jr.†, and Roy Campbell†
†High Performance Computing Modernization Program, U.S. Dept. of Defense

Published in: IEEE International Conference on Cluster Computing (CLUSTER), 2013. Available at IEEE

EP Analytics' research into viewing application/machine interactions through computational idioms published in MODSIM

Abstract: Models of application behavior are one of the keys to bridging the gap between current large-scale system design practices and upcoming exascale system designs. Processor/accelerator specialization and heterogeneity have been proposed as possible paths forward for attaining the significant energy efficiency improvements necessary to achieve exascale-level computing capabilities within an acceptable power envelope. To have an impact on the exascale system design process, the models must be (1) abstract, containing information that is relevant and actionable across a wide range of programming and execution models and (2) complementary to a well-defined and standardized machine characterization methodology.
We argue that a key component of this modeling paradigm is what we term an idiom, a small computational or memory access pattern. We hypothesize that much of the computational work within HPC can be expressed as the combination of a reasonably small number of basis idioms. Understanding application composition and machine characteristics in terms of how they behave in the presence of (combinations of) this small number of idioms allows us to bridge the gap between large workloads and an increasingly diverse and complex landscape of hardware options.

Michael Laurenzano, Laura Carrington, Ananta Tiwari, Joshua Peraza, William Ward, Jr.†, and Roy Campbell†
†High Performance Computing Modernization Program, U.S. Dept. of Defense

Published in: MODSIM (Workshop on Modeling & Simulation of Systems and Applications. Workshop sponsored by the U.S. Department of Energy, Office of Advanced Scientific Computing Research.), 2013. Available upon request.

Want to know more about our services and expertise? Contact Us Today