Tutorials are scheduled on 10/07/2012.
The Performance Dimensions of PC Servers
As one knows, the frequency of the CPU is no longer increasing, so other dimensions, such as vectors and cores, have to be deployed in an optimal way. In the introduction we define the seven performance dimensions, both inside the core and across cores and nodes. We also cover basics of the performance monitoring subsystem inside the CPUs that must be used in order to understand the performance issues with a given software program. We explain in detail the fact that data-oriented programming is vital in order to reach good performance inside each core. We go on to define the vocabulary and the issues that cover multithreaded programming. The software environment of multithreading is well explained, i.e. pthreads, OpenMP, Threading Building Blocks and Cilk+. Multinode programming is also explained. Practical examples are used throughout the tutorial to reinforce the theoretical teaching of performance-oriented programming. In the final session we sum up our own experience with clusters of PC servers (including accelerators) and give the students a summary of lessons learnt as well as a perspective for the foreseeable future.
Andrzej Nowak is a staff researcher at CERN openlab - a collaboration of CERN and industrial partners HP, Intel, Oracle, Siemens and Huawei. Andrzej's early research concerned operating systems security, mobile systems security, and wireless technologies. Prior to joining openlab, he worked at Intel, where he investigated custom performance optimizations of the Linux kernel and took part in developing one of the first implementations of the IEEE 802.16 "WiMax Mobile" standard. In 2007, Andrzej became a member of the CERN openlab as a Marie Curie Fellow sponsored by the European Commission. His current research is focused on performance tuning, parallelism and modern many-core processor architectures. Another part of Andrzej's activities is related to educational work both within and outside of CERN.
DAVE GOODELL AND JIM DINAN
A Hands-On Tutorial With MPI One-Sided Communication
This half-day tutorial covers all major aspects of MPI one-sided
communication (also known as MPI RMA). Topics will include MPI
active and passive communication models, noncontiguous
communication using MPI datatypes, MPI's shared data consistency
model, opportunities for performance tuning, and a preview of the
recently passed MPI 3.0 extension to one-sided communication.
Topics will be covered with a focus on application use-cases and
hands-on, actionable examples that will provide users with
take-home templates for common one-sided patterns.
David Goodell is a software developer at the Mathematics and Computer Science division of Argonne National Laboratory. He primarily works on the MPICH2 project, a widely-portable, high-quality implementation of the MPI standard. His research interests include communications software for parallel programming, system software portability, and lock-free algorithms for parallel programming.
Jim Dinan is the James Wallace Givens postdoctoral fellow at Argonne National Laboratory. He received his Ph.D. from the Department of Computer Science at The Ohio State University. His research interests include parallel programming models, parallel algorithms and applications, runtime systems, dynamic load balancing, computer architecture, and energy-aware computing.
Programming the GPU with CUDA
This tutorial gives a comprehensive introduction to programming
the GPU architecture using the Compute Unified Device Architecture
(CUDA). CUDA is an architecture and software paradigm designed
for generic computing and hence does not require an explicit use
of vertices, textures, colors, pixels and other elements of
traditional graphics programming. CUDA was born in late 2006 for
programming a GPU many-core architecture using SIMD extensions of
C language, and it is available for Windows, Linux and MacOS
users. A compiler generates executable code for the GPU, which is
seen by the CPU as a many-core co-processor/accelerator. Since
its inception, CUDA has achieved extraordinary speed-up factors in
a great range of grand-challenge applications and has continuously
increased its popularity within the High Performance Computing
community. To that extent, CUDA is a technology being taught at
more than 500 Universities worldwide, also sharing a range of
computational interfaces with two competitors: OpenCL, championed
by the Khronos Group, and DirectCompute, led by Microsoft. Third
party wrappers are also available for Python, Perl, Java, Fortran,
Ruby, Lua, Haskell, MatLab and IDL.
The tutorial is organized into two parts: First, we describe the
CUDA architecture through hardware generations until we reach
Fermi models. Second, we illustrate the way of programming
applications using those resources, transforming typical
sequential CPU programs into parallel codes. We emphasize the use
of CUDA threads hierarchy structured into blocks, grids and
kernels, and CUDA memory hierarchy decomposed into caches,
texture, constant and shared memory, plus a large register file.
For a programmer, the CUDA model is a collection of threads
running in parallel which can access any memory location, but, as
expected, performance boosts with the use of closer memory and/or
collectively read by groups of threads. Illustrating examples will
be used to discuss fundamental building blocks in CUDA,
programming tricks, memory optimizations and performance issues on
single graphics cards and even multi-GPU systems.
Manuel Ujaldon is Associate Professor at the Computer Architecture
Department, University of Malaga (Spain) and Conjoint Senior Lecturer
at the School of Electrical Engineering and Computer Science of the
University of Newcastle (Australia).
We worked in the 90's on parallelizing compilers, finishing his PhD
Thesis in 1996 by developing a data-parallel compiler for sparse matrix
and irregular applications. Over this period, we was part of the HPF
and MPI Forums, working as post-doc in the Computer Science Department
of the University of Maryland, College Park.
Over the past decade, we started working on the GPGPU movement early
in 2003 using Cg, and wrote the first book in spanish about programming GPUs
for general purpose computing, where he described how to map irregular
applications and linear algebra algorithms on GPUs. He adopted CUDA
when it was first released, then focusing on image processing and
biomedical applications. Over the past five years, he was authored more than
40 papers in journals and international conferences in these two areas.
Dr. Ujaldon has been awarded as NVIDIA Academic Partnership 2008-2011,
NVIDIA Teaching Center 2011-2013, NVIDIA Research Center 2012, and finally
CUDA Fellow 2012. He has taught more than 30 courses on CUDA programming
worldwide, including ACM and IEEE conferences and academic programs in
European, North American and Australian Universities.
Parallel I/O in Practice
I/O on HPC systems is a black art. This tutorial sheds light on
the state-of-the-art in parallel I/O and provides the knowledge
necessary for attendees to best leverage I/O resources available
to them. We cover the I/O software stack from parallel file
systems at the lowest layer, to intermediate layers (such as
MPI-IO), and finally high-level I/O libraries (such as HDF-5). We
emphasize ways to use these interfaces that result in high
performance, and benchmarks on real systems are used throughout to
show real-world results.
This tutorial first discusses parallel file systems in detail
(PFSs). We cover general concepts and examine four examples: GPFS,
Lustre, PanFS, and PVFS. We examine the upper layers of the I/O
stack, covering POSIX I/O, MPI-IO, Parallel netCDF, and HDF5. We
discuss interface features, show code examples, and describe how
application calls translate into PFS operations. Finally we
discuss I/O best practice.
Dries Kimpe is Assistant Computer Scientist at Argonne National Laboratory. He received his master's degree from Ghent University (Belgium) and, in 2008, his PhD from KU Leuven (Belgium). Dries Kimpe is also a fellow of the Computation Institute at the University of Chicago. His research interests include parallel file systems, programming models for high performance computing and numerical simulation. His current research focuses on the integration of storage into high performance computing systems (in-system storage), and the development of petascale storage systems.