ORPP logo
Image from Google Jackets

High Performance Parallelism Pearls Volume One : Multicore and Many-Core Programming Approaches.

By: Contributor(s): Material type: TextTextPublisher: San Diego : Elsevier Science & Technology, 2014Copyright date: ©2015Edition: 1st edDescription: 1 online resource (549 pages)Content type:
  • text
Media type:
  • computer
Carrier type:
  • online resource
ISBN:
  • 9780128021996
Subject(s): Genre/Form: Additional physical formats: Print version:: High Performance Parallelism Pearls Volume OneDDC classification:
  • 004.35
LOC classification:
  • QA76.642 -- .R456 2015eb
Online resources:
Contents:
Front Cover -- High Performance Parallelism Pearls: Multicore and Many-core Programming Approaches -- Copyright -- Contents -- Contributors -- Acknowledgments -- Foreword -- Humongous computing needs: Science years in the making -- Open standards -- Keen on many-core architecture -- Xeon Phi is born: Many cores, excellent vector ISA -- Learn highly scalable parallel programming -- Future demands grow: Programming models matter -- Preface -- Inspired by 61 cores: A new era in programming -- Chapter 1: Introduction -- Learning from successful experiences -- Code modernization -- Modernize with concurrent algorithms -- Modernize with vectorization and data locality -- Understanding power usage -- ISPC and OpenCL anyone? -- Intel Xeon Phi coprocessor specific -- Many-core, neo-heterogeneous -- No "Xeon Phi" in the title, neo-heterogeneous programming -- The future of many-core -- Downloads -- Chapter 2: From "Correct" to "Correct &amp -- Efficient": A Hydro2D Case Study with Godunov's Scheme -- Scientific computing on contemporary computers -- Modern computing environments -- CEA's Hydro2D -- A numerical method for shock hydrodynamics -- Euler's equation -- Godunov's method -- Where it fits -- Features of modern architectures -- Performance-oriented architecture -- Programming tools and runtimes -- Our computing environments -- Paths to performance -- Running Hydro2D -- Hydro2D's structure -- Computation scheme -- Data structures -- Measuring performance -- Optimizations -- Memory usage -- Thread-level parallelism -- Arithmetic efficiency and instruction-level parallelism -- Data-level parallelism -- Summary -- The coprocessor vs the processor -- A rising tide lifts all boats -- Performance strategies -- Chapter 3: Better Concurrency and SIMD on HBM -- The application: HIROMB - BOOS -Model -- Key usage: DMI -- HBM execution profile.
Overview for the optimization of HBM -- Data structures: Locality done right -- Thread parallelism in HBM -- Data parallelism: SIMD vectorization -- Trivial obstacles -- Premature abstraction is the root of all evil -- Results -- Profiling details -- Scaling on processor vs. coprocessor -- Contiguous attribute -- Summary -- References -- Chapter 4: Optimizing for Reacting Navier-Stokes Equations -- Getting started -- Version 1.0: Baseline -- Version 2.0: ThreadBox -- Version 3.0: Stack memory -- Version 4.0: Blocking -- Version 5.0: Vectorization -- Intel Xeon Phi coprocessor results -- Summary -- Chapter 5: Plesiochronous Phasing Barriers -- What can be done to improve the code? -- What more can be done to improve the code? -- Hyper-Thread Phalanx -- What is nonoptimal about this strategy? -- Coding the Hyper-Thread Phalanx -- How to determine thread binding to core and HT within core? -- The Hyper-Thread Phalanx hand-partitioning technique -- A lesson learned -- Back to work -- Data alignment -- Use aligned data when possible -- Redundancy can be good for you -- The plesiochronous phasing barrier -- Let us do something to recover this wasted time -- A few "left to the reader" possibilities -- Xeon host performance improvements similar to Xeon Phi -- Summary -- Chapter 6: Parallel Evaluation of Fault Tree Expressions -- Motivation and background -- Expressions -- Expression of choice: Fault trees -- An application for fault trees: Ballistic simulation -- Example implementation -- Syntax and parsing results -- Creating evaluation arrays -- Evaluating the expression array -- Using ispc for vectorization -- Other considerations -- Summary -- Chapter 7: Deep-Learning Numerical Optimization -- Fitting an objective function -- Objective functions and principle components analysis -- Software and example data -- Training data -- Runtime results.
Scaling results -- Summary -- Chapter 8: Optimizing Gather/Scatter Patterns -- Gather/scatter instructions in Intel® architecture -- Gather/scatter patterns in molecular dynamics -- Optimizing gather/scatter patterns -- Improving temporal and spatial locality -- Choosing an appropriate data layout: AoS versus SoA -- On-the-fly transposition between AoS and SoA -- Amortizing gather/scatter and transposition costs -- Summary -- Chapter 9: A Many-Core Implementation of the Direct N-Body Problem -- N-Body simulations -- Initial solution -- Theoretical limit -- Reduce the overheads, align your data -- Optimize the memory hierarchy -- Improving our tiling -- What does all this mean to the host version? -- Summary -- Chapter 10: N -Body Methods -- Fast N -body methods and direct N -body kernels -- Applications of N -body methods -- Direct N -body code -- Performance results -- Summary -- Chapter 11: Dynamic Load Balancing Using OpenMP 4.0 -- Maximizing hardware usage -- The N-Body kernel -- The offloaded version -- A first processor combined with coprocessor version -- Version for processor with multiple coprocessors -- Chapter 12: Concurrent Kernel Offloading -- Setting the context -- Motivating example: particle dynamics -- Organization of this chapter -- Concurrent kernels on the coprocessor -- Coprocessor device partitioning and thread affinity -- Offloading from OpenMP host program -- Offloading from MPI host program -- Case study: concurrent Intel MKL dgemm offloading -- Persistent thread groups and affinities on the coprocessor -- Concurrent data transfers -- Case study: concurrent MKL dgemm offloading with data transfers -- Force computation in PD using concurrent kernel offloading -- Parallel force evaluation using Newton's 3rd law -- Implementation of the concurrent force computation -- Performance evaluation: before and after -- The bottom line.
Chapter 13. Heterogeneous Computing with MPI -- MPI in the modern clusters -- MPI task location -- Single-task hybrid programs -- Selection of the DAPL providers -- The first provider ofa-v2-mlx4_0-1u -- The second provider ofa-v2-scif0 and the impact of the intra-node fabric -- The last provider, also called the proxy -- Hybrid application scalability -- Load balance -- Task and thread mapping -- Summary -- Acknowledgments -- Chapter 14: Power Analysis on the Intel® Xeon Phi™ Coprocessor -- Power analysis 101 -- Measuring power and temperature with software -- Creating a power and temperature monitor script -- Creating a power and temperature logger with the micsmc tool -- Power analysis using IPMI -- Hardware-based power analysis methods -- A hardware-based coprocessor power analyzer -- Summary -- Chapter 15: Integrating Intel Xeon Phi Coprocessors into a Cluster Environment -- Early explorations -- Beacon system history -- Beacon system architecture -- Hardware -- Software environment -- Intel MPSS installation procedure -- Preparing the system -- Installation of the Intel MPSS stack -- Generating and customizing configuration files -- MPSS upgrade procedure -- Setting up the resource and workload managers -- Torque -- Prologue -- Epilogue -- TORQUE /coprocessor integration -- Moab -- Improving network locality -- Moab/coprocessor integration -- Health checking and monitoring -- Scripting common commands -- User software environment -- Future directions -- Summary -- Acknowledgments -- Chapter 16: Supporting Cluster File Systems on Intel® Xeon Phi™ Coprocessors -- Network configuration concepts and goals -- A look at networking options -- Steps to set up a cluster enabled coprocessor -- Coprocessor file systems support -- Support for NFS -- Support for Lustre® file system -- Support for Fraunhofer BeeGFS ® (formerly FHGFS) file system.
Support for Panasas® PanFS ® file system -- Choosing a cluster file system -- Summary -- Chapter 17. NWChem: Quantum Chemistry Simulations at Scale -- Introduction -- Overview of single-reference CC formalism -- NWChem software architecture -- Global Arrays -- Tensor Contraction Engine -- Engineering an offload solution -- Offload architecture -- Kernel optimizations -- Performance evaluation -- Summary -- Acknowledgments -- Chapter 18: Efficient Nested Parallelism on Large-Scale Systems -- Motivation -- The benchmark -- Baseline benchmarking -- Pipeline approach-flat_arena class -- Intel® TBB user-managed task arenas -- Hierarchical approach-hierarchical_arena class -- Performance evaluation -- Implication on NUMA architectures -- Summary -- Chapter 19: Performance Optimization of Black-Scholes Pricing -- Financial market model basics and the Black-Scholes formula -- Financial market mathematical model -- European option and fair price concepts -- Black-Scholes formula -- Options pricing -- Test infrastructure -- Case study -- Preliminary version-Checking correctness -- Reference version-Choose appropriate data structures -- Reference version-Do not mix data types -- Vectorize loops -- Use fast math functions: erff() vs. cdfnormf() -- Equivalent transformations of code -- Align arrays -- Reduce precision if possible -- Work in parallel -- Use warm-up -- Using the Intel Xeon Phi coprocessor-"No effort" port -- Use Intel Xeon Phi coprocessor: Work in parallel -- Use Intel Xeon Phi coprocessor and streaming stores -- Summary -- Chapter 20: Data Transfer Using the Intel COI Library -- First steps with the Intel COI library -- COI buffer types and transfer performance -- Applications -- Summary -- Chapter 21: High-Performance Ray Tracing -- Background -- Vectorizing ray traversal -- The Embree ray tracing kernels -- Using Embree in an application.
Performance.
Tags from this library: No tags from this library for this title. Log in to add tags.
Star ratings
    Average rating: 0.0 (0 votes)
No physical items for this record

Front Cover -- High Performance Parallelism Pearls: Multicore and Many-core Programming Approaches -- Copyright -- Contents -- Contributors -- Acknowledgments -- Foreword -- Humongous computing needs: Science years in the making -- Open standards -- Keen on many-core architecture -- Xeon Phi is born: Many cores, excellent vector ISA -- Learn highly scalable parallel programming -- Future demands grow: Programming models matter -- Preface -- Inspired by 61 cores: A new era in programming -- Chapter 1: Introduction -- Learning from successful experiences -- Code modernization -- Modernize with concurrent algorithms -- Modernize with vectorization and data locality -- Understanding power usage -- ISPC and OpenCL anyone? -- Intel Xeon Phi coprocessor specific -- Many-core, neo-heterogeneous -- No "Xeon Phi" in the title, neo-heterogeneous programming -- The future of many-core -- Downloads -- Chapter 2: From "Correct" to "Correct &amp -- Efficient": A Hydro2D Case Study with Godunov's Scheme -- Scientific computing on contemporary computers -- Modern computing environments -- CEA's Hydro2D -- A numerical method for shock hydrodynamics -- Euler's equation -- Godunov's method -- Where it fits -- Features of modern architectures -- Performance-oriented architecture -- Programming tools and runtimes -- Our computing environments -- Paths to performance -- Running Hydro2D -- Hydro2D's structure -- Computation scheme -- Data structures -- Measuring performance -- Optimizations -- Memory usage -- Thread-level parallelism -- Arithmetic efficiency and instruction-level parallelism -- Data-level parallelism -- Summary -- The coprocessor vs the processor -- A rising tide lifts all boats -- Performance strategies -- Chapter 3: Better Concurrency and SIMD on HBM -- The application: HIROMB - BOOS -Model -- Key usage: DMI -- HBM execution profile.

Overview for the optimization of HBM -- Data structures: Locality done right -- Thread parallelism in HBM -- Data parallelism: SIMD vectorization -- Trivial obstacles -- Premature abstraction is the root of all evil -- Results -- Profiling details -- Scaling on processor vs. coprocessor -- Contiguous attribute -- Summary -- References -- Chapter 4: Optimizing for Reacting Navier-Stokes Equations -- Getting started -- Version 1.0: Baseline -- Version 2.0: ThreadBox -- Version 3.0: Stack memory -- Version 4.0: Blocking -- Version 5.0: Vectorization -- Intel Xeon Phi coprocessor results -- Summary -- Chapter 5: Plesiochronous Phasing Barriers -- What can be done to improve the code? -- What more can be done to improve the code? -- Hyper-Thread Phalanx -- What is nonoptimal about this strategy? -- Coding the Hyper-Thread Phalanx -- How to determine thread binding to core and HT within core? -- The Hyper-Thread Phalanx hand-partitioning technique -- A lesson learned -- Back to work -- Data alignment -- Use aligned data when possible -- Redundancy can be good for you -- The plesiochronous phasing barrier -- Let us do something to recover this wasted time -- A few "left to the reader" possibilities -- Xeon host performance improvements similar to Xeon Phi -- Summary -- Chapter 6: Parallel Evaluation of Fault Tree Expressions -- Motivation and background -- Expressions -- Expression of choice: Fault trees -- An application for fault trees: Ballistic simulation -- Example implementation -- Syntax and parsing results -- Creating evaluation arrays -- Evaluating the expression array -- Using ispc for vectorization -- Other considerations -- Summary -- Chapter 7: Deep-Learning Numerical Optimization -- Fitting an objective function -- Objective functions and principle components analysis -- Software and example data -- Training data -- Runtime results.

Scaling results -- Summary -- Chapter 8: Optimizing Gather/Scatter Patterns -- Gather/scatter instructions in Intel® architecture -- Gather/scatter patterns in molecular dynamics -- Optimizing gather/scatter patterns -- Improving temporal and spatial locality -- Choosing an appropriate data layout: AoS versus SoA -- On-the-fly transposition between AoS and SoA -- Amortizing gather/scatter and transposition costs -- Summary -- Chapter 9: A Many-Core Implementation of the Direct N-Body Problem -- N-Body simulations -- Initial solution -- Theoretical limit -- Reduce the overheads, align your data -- Optimize the memory hierarchy -- Improving our tiling -- What does all this mean to the host version? -- Summary -- Chapter 10: N -Body Methods -- Fast N -body methods and direct N -body kernels -- Applications of N -body methods -- Direct N -body code -- Performance results -- Summary -- Chapter 11: Dynamic Load Balancing Using OpenMP 4.0 -- Maximizing hardware usage -- The N-Body kernel -- The offloaded version -- A first processor combined with coprocessor version -- Version for processor with multiple coprocessors -- Chapter 12: Concurrent Kernel Offloading -- Setting the context -- Motivating example: particle dynamics -- Organization of this chapter -- Concurrent kernels on the coprocessor -- Coprocessor device partitioning and thread affinity -- Offloading from OpenMP host program -- Offloading from MPI host program -- Case study: concurrent Intel MKL dgemm offloading -- Persistent thread groups and affinities on the coprocessor -- Concurrent data transfers -- Case study: concurrent MKL dgemm offloading with data transfers -- Force computation in PD using concurrent kernel offloading -- Parallel force evaluation using Newton's 3rd law -- Implementation of the concurrent force computation -- Performance evaluation: before and after -- The bottom line.

Chapter 13. Heterogeneous Computing with MPI -- MPI in the modern clusters -- MPI task location -- Single-task hybrid programs -- Selection of the DAPL providers -- The first provider ofa-v2-mlx4_0-1u -- The second provider ofa-v2-scif0 and the impact of the intra-node fabric -- The last provider, also called the proxy -- Hybrid application scalability -- Load balance -- Task and thread mapping -- Summary -- Acknowledgments -- Chapter 14: Power Analysis on the Intel® Xeon Phi™ Coprocessor -- Power analysis 101 -- Measuring power and temperature with software -- Creating a power and temperature monitor script -- Creating a power and temperature logger with the micsmc tool -- Power analysis using IPMI -- Hardware-based power analysis methods -- A hardware-based coprocessor power analyzer -- Summary -- Chapter 15: Integrating Intel Xeon Phi Coprocessors into a Cluster Environment -- Early explorations -- Beacon system history -- Beacon system architecture -- Hardware -- Software environment -- Intel MPSS installation procedure -- Preparing the system -- Installation of the Intel MPSS stack -- Generating and customizing configuration files -- MPSS upgrade procedure -- Setting up the resource and workload managers -- Torque -- Prologue -- Epilogue -- TORQUE /coprocessor integration -- Moab -- Improving network locality -- Moab/coprocessor integration -- Health checking and monitoring -- Scripting common commands -- User software environment -- Future directions -- Summary -- Acknowledgments -- Chapter 16: Supporting Cluster File Systems on Intel® Xeon Phi™ Coprocessors -- Network configuration concepts and goals -- A look at networking options -- Steps to set up a cluster enabled coprocessor -- Coprocessor file systems support -- Support for NFS -- Support for Lustre® file system -- Support for Fraunhofer BeeGFS ® (formerly FHGFS) file system.

Support for Panasas® PanFS ® file system -- Choosing a cluster file system -- Summary -- Chapter 17. NWChem: Quantum Chemistry Simulations at Scale -- Introduction -- Overview of single-reference CC formalism -- NWChem software architecture -- Global Arrays -- Tensor Contraction Engine -- Engineering an offload solution -- Offload architecture -- Kernel optimizations -- Performance evaluation -- Summary -- Acknowledgments -- Chapter 18: Efficient Nested Parallelism on Large-Scale Systems -- Motivation -- The benchmark -- Baseline benchmarking -- Pipeline approach-flat_arena class -- Intel® TBB user-managed task arenas -- Hierarchical approach-hierarchical_arena class -- Performance evaluation -- Implication on NUMA architectures -- Summary -- Chapter 19: Performance Optimization of Black-Scholes Pricing -- Financial market model basics and the Black-Scholes formula -- Financial market mathematical model -- European option and fair price concepts -- Black-Scholes formula -- Options pricing -- Test infrastructure -- Case study -- Preliminary version-Checking correctness -- Reference version-Choose appropriate data structures -- Reference version-Do not mix data types -- Vectorize loops -- Use fast math functions: erff() vs. cdfnormf() -- Equivalent transformations of code -- Align arrays -- Reduce precision if possible -- Work in parallel -- Use warm-up -- Using the Intel Xeon Phi coprocessor-"No effort" port -- Use Intel Xeon Phi coprocessor: Work in parallel -- Use Intel Xeon Phi coprocessor and streaming stores -- Summary -- Chapter 20: Data Transfer Using the Intel COI Library -- First steps with the Intel COI library -- COI buffer types and transfer performance -- Applications -- Summary -- Chapter 21: High-Performance Ray Tracing -- Background -- Vectorizing ray traversal -- The Embree ray tracing kernels -- Using Embree in an application.

Performance.

Description based on publisher supplied metadata and other sources.

Electronic reproduction. Ann Arbor, Michigan : ProQuest Ebook Central, 2024. Available via World Wide Web. Access may be limited to ProQuest Ebook Central affiliated libraries.

There are no comments on this title.

to post a comment.

© 2024 Resource Centre. All rights reserved.