Introduction and motivation¶

Python is widely used because it is productive, expressive, and surrounded by excellent scientific libraries. The same qualities can hide performance bottlenecks: a loop that is acceptable for a small dataset may become impractical when the data grows, and adding more cores or accelerators does not automatically make a program faster.

Parallel computing is not one technique but a set of tools for different bottlenecks. Threads can help when work waits on input/output or when optimized native libraries release Python’s Global Interpreter Lock. Processes can use multiple CPU cores for compute-bound Python work, but they introduce overhead and data-copying costs. MPI supports explicit communication across processes and machines. Snakemake expresses independent workflow steps that can run concurrently. ipyparallel connects interactive Jupyter workflows with parallel engines, including MPI-backed execution. GPUs provide massive throughput for suitable arithmetic workloads, but data transfer and memory layout matter. Dask scales familiar Python data workflows through lazy task graphs, chunked collections, and distributed schedulers.

Throughout the course, the goal is to reason about performance rather than simply add parallelism. Learners will practice identifying independent work, avoiding race conditions, measuring overhead, choosing suitable chunk sizes, understanding communication and transfer costs, and validating that optimized code still produces correct results.