Learn CUDA Programming

更新时间：2021-08-20 09:58:51

coverpage

Title Page

Dedication

About Packt

Why subscribe?

Contributors

About the authors

About the reviewers

Packt is searching for authors like you

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Download the color images

Conventions used

Get in touch

Reviews

Introduction to CUDA Programming

The history of high-performance computing

Heterogeneous computing

Programming paradigm

Low latency versus higher throughput

Programming approaches to GPU

Technical requirements

Hello World from CUDA

Thread hierarchy

GPU architecture

Vector addition using CUDA

Experiment 1 – creating multiple blocks

Experiment 2 – creating multiple threads

Experiment 3 – combining blocks and threads

Why bother with threads and blocks?

Launching kernels in multiple dimensions

Error reporting in CUDA

Data type support in CUDA

Summary

CUDA Memory Management

Technical requirements

NVIDIA Visual Profiler

Global memory/device memory

Vector addition on global memory

Coalesced versus uncoalesced global memory access

Memory throughput analysis

Shared memory

Matrix transpose on shared memory

Bank conflicts and its effect on shared memory

Read-only data/cache

Computer vision – image scaling using texture memory

Registers in GPU

Pinned memory

Bandwidth test – pinned versus pageable

Unified memory

Understanding unified memory page allocation and transfer

Optimizing unified memory with warp per page

Optimizing unified memory using data prefetching

GPU memory evolution

Why do GPUs have caches?

Summary

CUDA Thread Programming

Technical requirements

CUDA threads blocks and the GPU

Exploiting a CUDA block and warp

Understanding CUDA occupancy

Setting NVCC to report GPU resource usages

The settings for Linux

Settings for Windows

Analyzing the optimal occupancy using the Occupancy Calculator

Occupancy tuning – bounding register usage

Getting the achieved occupancy from the profiler

Understanding parallel reduction

Naive parallel reduction using global memory

Reducing kernels using shared memory

Writing performance measurement code

Performance comparison for the two reductions – global and shared memory

Identifying the application's performance limiter

Finding the performance limiter and optimization

Minimizing the CUDA warp divergence effect

Determining divergence as a performance bottleneck

Interleaved addressing

Sequential addressing

Performance modeling and balancing the limiter

The Roofline model

Maximizing memory bandwidth with grid-strided loops

Balancing the I/O throughput

Warp-level primitive programming

Parallel reduction with warp primitives

Cooperative Groups for flexible thread handling

Cooperative Groups in a CUDA thread block

Benefits of Cooperative Groups

Modularity

Explicit grouped threads' operation and race condition avoidance

Dynamic active thread selection

Applying to the parallel reduction

Cooperative Groups to avoid deadlock

Loop unrolling in the CUDA kernel

Atomic operations

Low/mixed precision operations