WOSB & WBENC-Certified WBE
WOSB & WBENC-Certified WBE

Company

About

EP Analytics provides expertise and software tools for application-focused performance analysis and optimization of High-Performance Computing (HPC) systems. The company assists clients in the government and private sectors in designing and procuring mission critical HPC systems and maximizing the effectiveness of existing systems based on specific application characteristics. EP Analytics' expertise and tools can assist enterprises in maximizing the return-on-investment in HPC systems, aid in the analysis for procurement RFPs, and provide actionable advice on the benefit of hardware upgrades for given workloads. The company's team has expertise in x86, Power, and ARM processor architectures, parallel computing, program inspection and analysis, hardware-software co-design, performance characterization, and other areas.

As part of the Weather and Climate Computing Infrastructure Services (WCCIS) project, our team at NOAA assists on on-boarding critical weather modeling codes and ensures that production models execute within rigid schedules.

EP Analytics is a woman-owned small business. EP Analytics has received certifications for Women’s Business Enterprise (WBE) and a Women Owned Small Business (WOSB) by the Women’s Business Enterprise National Council (WBENC).

Leadership

Laura Carrington, Ph.D.

Co-founder and President

Dr. Carrington is an expert in High Performance Computing with over 50 publications in HPC benchmarking, workload analysis, application performance analysis and optimization, analysis of accelerators (i.e. FPGAs and GPUs) for scientific workloads, tools in performance analysis (i.e. processor and network simulators), and energy-efficient computing. She has presented at numerous invited talks, is a member of various panels and committees, and has been a member of the DoD HPCMP Performance team involved in their annual HPC system procurement for over 10 years.

Ananta Tiwari, Ph.D.

VP of Research Operations

Dr. Tiwari is an expert in performance, power and energy modeling and in model- and empirically-driven application tuning and optimization. His research interests include developing analytical and statistical models for the energy consumption of large-scale data-intensive applications and utilizing those models as the basis for developing application- and energy-aware auto-tuning techniques for HPC applications. He is an AWS certified Solutions Architect.

Sarom Leang, Ph.D.

Software Engineer | Computational Scientist

Dr. Leang obtained his Ph.D. from Iowa State University, under Mark S. Gordon, where his studies focused on the development and application of the highly-scalable General Atomic and Molecular Electronic Structure Systems (GAMESS) quantum chemistry code. He completed his post-doctoral studies at Iowa State University analyzing the performance of compute-intensive kernels in GAMESS on emerging energy-efficient computational architectures (i.e. GPGPU, MIC and ARM). In 2014, Dr. Leang joined the Ames Laboratory staff as an assistant scientist and later became the development lead for GAMESS. As development lead, he introduced the use of container technology to drive the automation of code building, testing and validation. At EP Analytics, Dr. Leang provides a unique “end user” perspective and continues his passion of assisting domain scientists in adapting new software technologies into existing scientific development to enhance productivity and output.

Allyson Cauble-Chantrenne

Senior Software Engineer

Ms. Cauble-Chantrenne has been with EP Analytics since she graduated in 2014 with her M.S. in Computer Science from the University of California, San Diego. She is particularly interested in developing performance analysis tools for High Performance Computing. She has been the main software engineer for the X86 binary instrumentation tools. She has developed, ported, and optimized the X86 BI tools and aided in integrating them into PerfPal. She is currently the project manager for the MemInsight project, a DOD Phase II SBIR involving analysis of data movement and thread-level performance of HPC applications.

Careers

We are always looking for talented individuals who are passionate about HPC to join our team! Contact us to discuss more!

Publications

A partial list of publications is provided below. For full list, please visit Dr. Carrington's Google Scholar Page.

Recent developments in the general atomic and molecular electronic structure system

Intel's latest Xeon Phi processor, Knights Landing (KNL), has the potential to provide over 2.6 TFLOPS. However, to obtain maximum performance on the KNL, significant refactoring and optimization of application codes are still required to exploit key architectural innovations that KNL features—wide vector units, many‐core node design, and deep memory hierarchy. The experience and insights gained in porting and running FEFLO (a typical edge‐based finite element code for the solution of compressible and incompressible flows) on the KNL platform are described in this paper. In particular, optimizations used to extract on‐node parallelism via vectorization and multithreading and improve internode communication are considered. These optimizations resulted in a 2.3× performance gain on a 16 node runs of FEFLO, with the potential for larger performance gains as the code is scaled beyond 16 nodes. The impact of the different configurations of KNL's on‐package MCDRAM (Multi‐Channel DRAM) memory on FEFLO's performance is also explored. Finally, the performance of the optimized versions of FEFLO for KNL and Haswell (Intel Xeon) is compared.

https://doi.org/10.1063/5.0005188

Characterization and Bottleneck Analysis of a 64-bit ARMv8 Platform

This paper presents the first comprehensive study of the performance, power and energy efficiency of the Applied- Micro X-Gene, the first commercially available 64-bit ARMv8 platform. Our study includes a detailed comparison of the X-Gene to three other architectural design points common in HPC systems. Across these platforms, we perform careful measurements across 400+ workloads, covering different application domains, parallelization models, floating-point precision models, memory intensities, and several other features. We find that the X-Gene has 1.2× better energy consumption than an Intel Sandy Bridge, a design commonly found in HPC installations, while the Sandy Bridge is 2.3× faster.‍

Precisely quantifying the causes of performance and energy differences between two platforms is a challenging problem. This paper is the first to adopt a statistical framework called Partial Least Squares (PLS) Path Modeling to this problem. PLS Path Modeling allows us to capture complex cause-effect relationships and difficult-to-measure performance concepts relating to the effectiveness of architectural units and subsystems in improving application performance. Using PLS Path Modeling to quantify the causes of the performance differences between X-Gene and Sandy Bridge in the HPC domain, our efforts reveal that the performance of the memory subsystem is the dominant factor.

10.1109/ISPASS.2016.7482072

Running large‐scale CFD applications on Intel‐KNL–based clusters

Intel's latest Xeon Phi processor, Knights Landing (KNL), has the potential to provide over 2.6 TFLOPS. However, to obtain maximum performance on the KNL, significant refactoring and optimization of application codes are still required to exploit key architectural innovations that KNL features—wide vector units, many‐core node design, and deep memory hierarchy. The experience and insights gained in porting and running FEFLO (a typical edge‐based finite element code for the solution of compressible and incompressible flows) on the KNL platform are described in this paper. In particular, optimizations used to extract on‐node parallelism via vectorization and multithreading and improve internode communication are considered. These optimizations resulted in a 2.3× performance gain on a 16 node runs of FEFLO, with the potential for larger performance gains as the code is scaled beyond 16 nodes. The impact of the different configurations of KNL's on‐package MCDRAM (Multi‐Channel DRAM) memory on FEFLO's performance is also explored. Finally, the performance of the optimized versions of FEFLO for KNL and Haswell (Intel Xeon) is compared.

https://doi.org/10.1002/fld.4474

Performance and Energy Efficiency Analysis of 64-bit ARM Using GAMESS

Abstract: Power efficiency is one of the key challenges facing the HPC co-design community, sparking interest in the ARM processor architecture as a low-power high-efficiency alternative to the high-powered systems that dominate today. Recent advances in the ARM architecture, including the introduction of 64-bit support, have only fueled more interest in ARM. While ARM-based clusters have proven to be useful for data server applications, their viability for HPC applications requires an in-depth analysis of on-node and inter-node performance. To that end, as a co-design exercise, the viability of a commercially available 64-bit ARM cluster is investigated in terms of performance and energy efficiency with the widely used quantum chemistry package GAMESS. The performance and energy efficiency metrics are also compared to a conventional x86 Intel Ivy Bridge system. A 2:1 Moonshot core to Ivy Bridge core performance ratio is observed for the GAMESS calculation types considered. Doubling the number of cores to complete the execution faster on the 64-bit ARM cluster leads to better energy efficiency compared to the Ivy Bridge system; i.e., a 32-core execution of GAMESS calculation has approximately the same performance and better energy-to-solution than a 16-core execution of the same calculation on the Ivy Bridge system.

https://doi.org/10.1145/2834899.2834905

VecMeter: Measuring Vectorization on the Xeon Phi

Abstract: Wide vector units in Intel’s Xeon Phi accelerator cards can significantly boost application performance when used effectively. However, there is a lack of performance tools that provide programmers accurate information about the level of vectorization in their codes. This paper presents VecMeter, an easy-to-use tool to measure vectorization on the Xeon Phi. VecMeter utilizes binary instrumentation so no source code modifications are necessary. This paper presents design details of VecMeter, demonstrates its accuracy, defines a metric for quantifying vectorization, and provides an example where the tool can guide optimization of some code sections to improve performance by up to 33%.

10.1109/CLUSTER.2015.73

Making the Most of SMT in HPC: System- and Application-Level Perspectives

This work presents an end-to-end methodology for quantifying the performance and power benefits of simultaneous multithreading (SMT) for HPC centers and applies this methodology to a production system and workload. Ultimately, SMT’s value system-wide depends on whether users effectively employ SMT at the application level. However, predicting SMT’s benefit for HPC applications is challenging; by doubling the number of threads, the application’s characteristics may change. This work proposes statistical modeling techniques to predict the speedup SMT confers to HPC applications. This approach, accurate to within 8%, uses only lightweight, transparent performance monitors collected during a single run of the application.

https://doi.org/10.1145/2687651

Characterizing the Performance-Energy Tradeoff of Small ARM Cores in HPC Computation

Deploying large numbers of small, low-power cores has been gaining traction recently as a system design strategy in high performance computing (HPC). The ARM platform that dominates the embedded and mobile computing segments is now being considered as an alternative to high-end x86 processors that largely dominate HPC because peak performance per watt may be substantially improved using off-the-shelf commodity processors.In this work we methodically characterize the performance and energy of HPC computations drawn from a number of problem domains on current ARM and x86 processors. Unsurprisingly, we find that the performance, energy and energy-delay product of applications running on these platforms varies significantly across problem types and inputs.

 Using static program analysis we further show that this variation can be explained largely in terms of the capabilities of two processor subsystems: single instruction multiple data (SIMD)/floating point and the cache/memory hierarchy; and that static analysis of this kind is sufficient to predict which platform is best for a particular application/input pair. In the context of these findings, we evaluate how some of the key architectural changes being made for upcoming 64-bit ARM platforms may impact HPC application performance.

https://rdcu.be/cdGRA

PEBIL: binary instrumentation for practical data-intensive program analysis

In order to achieve a high level of performance, data intensive programs such as the real-time processing of surveillance feeds from unmanned aerial vehicles, genomics sequence comparison or large graph traversal require the strategic application of multi/many-core processors and co-processors using a hybrid of inter-process message passing (e.g. MPI and SHMEM) and intra-process threading (e.g. pthreads and OpenMP). To facilitate program and system design decisions, program runtime behavior gathered through binary instrumentation is useful because it enables inspection of the low-level interactions between a data intensive program and a multi-core processor or many-core co-processor. This work details two novel mechanisms in the PEBIL binary instrumentation platform that make it well-suited for analyzing data-intensive programs by providing (1) support for fast lookup of instrumentation thread-local storage (ITLS) and (2) support for the fast enabling and disabling of instrumentation at runtime as a methodology for supporting sampling. These features are compared to two other popular binary instrumentation platforms, Pin and Dyninst, in both analytical and empirical terms for programs implemented using the popular but disparate parallelization models MPI and OpenMP. Empirical comparisons are made for two binary instrumentation applications that are critical to the analysis of data-intensive programs, basic block counting and memory address trace collection. These empirical results show that PEBIL is unrivaled in terms of overhead for basic block counting, introducing an average of 18 % extra runtime for MPI programs and 116 % for OpenMP programs as opposed to 60 % (MPI) and 232 % (OpenMP) for Pin and 20 % (MPI) and 14743 % (OpenMP) for Dyninst. For memory address trace collection that makes use of the conventional optimization of sampling 10 % of the memory addresses of a program to reduce processing time, PEBIL also introduces the lowest overheads of 144 % (MPI) and 222 % (OpenMP) compared to 313 % (MPI) and 360 % (OpenMP) with Pin and 1113 % (MPI) and 89075 % (OpenMP) with Dyninst.

https://doi.org/10.1007/s10586-013-0307-2