更新时间:2021-08-20 09:58:51
coverpage
Title Page
Copyright and Credits
Learn CUDA Programming
Dedication
About Packt
Why subscribe?
Contributors
About the authors
About the reviewers
Packt is searching for authors like you
Preface
Who this book is for
What this book covers
To get the most out of this book
Download the example code files
Download the color images
Conventions used
Get in touch
Reviews
Introduction to CUDA Programming
The history of high-performance computing
Heterogeneous computing
Programming paradigm
Low latency versus higher throughput
Programming approaches to GPU
Technical requirements
Hello World from CUDA
Thread hierarchy
GPU architecture
Vector addition using CUDA
Experiment 1 – creating multiple blocks
Experiment 2 – creating multiple threads
Experiment 3 – combining blocks and threads
Why bother with threads and blocks?
Launching kernels in multiple dimensions
Error reporting in CUDA
Data type support in CUDA
Summary
CUDA Memory Management
NVIDIA Visual Profiler
Global memory/device memory
Vector addition on global memory
Coalesced versus uncoalesced global memory access
Memory throughput analysis
Shared memory
Matrix transpose on shared memory
Bank conflicts and its effect on shared memory
Read-only data/cache
Computer vision – image scaling using texture memory
Registers in GPU
Pinned memory
Bandwidth test – pinned versus pageable
Unified memory
Understanding unified memory page allocation and transfer
Optimizing unified memory with warp per page
Optimizing unified memory using data prefetching
GPU memory evolution
Why do GPUs have caches?
CUDA Thread Programming
CUDA threads blocks and the GPU
Exploiting a CUDA block and warp
Understanding CUDA occupancy
Setting NVCC to report GPU resource usages
The settings for Linux
Settings for Windows
Analyzing the optimal occupancy using the Occupancy Calculator
Occupancy tuning – bounding register usage
Getting the achieved occupancy from the profiler
Understanding parallel reduction
Naive parallel reduction using global memory
Reducing kernels using shared memory
Writing performance measurement code
Performance comparison for the two reductions – global and shared memory
Identifying the application's performance limiter
Finding the performance limiter and optimization
Minimizing the CUDA warp divergence effect
Determining divergence as a performance bottleneck
Interleaved addressing
Sequential addressing
Performance modeling and balancing the limiter
The Roofline model
Maximizing memory bandwidth with grid-strided loops
Balancing the I/O throughput
Warp-level primitive programming
Parallel reduction with warp primitives
Cooperative Groups for flexible thread handling
Cooperative Groups in a CUDA thread block
Benefits of Cooperative Groups
Modularity
Explicit grouped threads' operation and race condition avoidance
Dynamic active thread selection
Applying to the parallel reduction
Cooperative Groups to avoid deadlock
Loop unrolling in the CUDA kernel
Atomic operations
Low/mixed precision operations