Quiz solutions

Multiple choice, single answer

  1. What is the main effect of Python’s Global Interpreter Lock (GIL) on CPU-bound Python threads?

  • A) It prevents multiple Python bytecode-executing threads from running at the same time in one process

  1. Which workload is most likely to benefit from Python threads?

  • B) A task that spends much of its time waiting for file, network, or other I/O

  1. Why can multiprocessing speed up CPU-bound Python work?

  • B) Processes can run in separate interpreters with separate GILs

  1. What is a race condition?

  • B) Two or more concurrent tasks access shared state and the result depends on timing

  1. In MPI, what does a rank identify?

  • C) One process within an MPI communicator

  1. Which MPI operation is most appropriate when rank 0 needs to collect partial results from all ranks?

  • B) gather

  1. What makes a workflow embarrassingly parallel?

  • B) Tasks can run independently with little or no communication

  1. What does Snakemake use to decide which workflow steps can run in parallel?

  • C) Declared input and output dependencies between rules

  1. Which computation is usually a good candidate for GPU acceleration?

  • A) Many independent arithmetic operations over large arrays

  1. In a CUDA kernel, what are blocks and threads used for?

  • A) Organizing parallel work on the GPU

  1. Why can copying data between CPU memory and GPU memory reduce speedup?

  • B) Transfer time can dominate if the computation is too small

  1. What does lazy evaluation mean in Dask?

  • B) Dask builds a task graph and delays execution until a result is requested

  1. Why does Dask chunk size matter?

  • A) It controls the balance between memory use, parallelism, and scheduling overhead

  1. Which Dask scheduler is designed for scaling work across multiple worker processes and potentially multiple machines?

  • C) distributed

Short conceptual questions

  1. Explain why CPU-bound pure Python code may not speed up when using threads.

    CPU-bound pure Python code usually spends its time executing Python bytecode. In one Python process, the GIL allows only one thread at a time to execute Python bytecode, so multiple threads do not normally use multiple CPU cores for this kind of work.

  2. For each task below, choose a suitable parallel strategy and briefly justify it:

  • Downloading many files: use threads or asynchronous I/O because the work is mostly waiting on I/O.

  • Computing independent numerical integrals in pure Python: use multiprocessing because each process has its own interpreter and can run on a separate CPU core.

  • Running the same analysis on many input files with clear input/output dependencies: use Snakemake because the workflow can be expressed as rules with dependencies.

  • Applying the same arithmetic operation to a very large array on a suitable accelerator: use GPU computing with Numba/CUDA or a GPU array library because the work has many independent arithmetic operations.

  • Processing an array too large to fit comfortably in memory: use Dask arrays with appropriate chunking so computation can be lazy and chunk-based.

  1. Explain why parallel code can be slower than serial code for small problems.

    Parallel execution has overhead: creating threads or processes, scheduling tasks, copying or communicating data, transferring data to a GPU, and combining results. If the computation is small, these costs can exceed the time saved by parallel execution.

  2. Describe two ways to avoid or fix race conditions.

    One approach is to protect shared mutable state with synchronization such as a lock. Another is to avoid shared mutable state by giving each worker its own data and combining independent results after the parallel section.

  3. Compare MPI, Snakemake, and Dask at a high level. What kind of problem is each one well suited for?

    MPI is suited for explicit communication between distributed processes, especially when a program needs control over ranks and message passing. Snakemake is suited for file-based workflows where rules declare input and output dependencies. Dask is suited for task-graph-based parallel analytics, especially with chunked arrays, dataframes, bags, delayed functions, or distributed workers.

Coding and performance-analysis questions

  1. A serial program computes a function work(x) independently for every value in a list called items. Sketch how to parallelize it with multiprocessing.Pool.

from multiprocessing import Pool

with Pool(processes=4) as pool:
    results = pool.map(work, items)

If work needs several arguments, use Pool.starmap or create a partially applied function with functools.partial.

  1. The following threaded code updates shared state. Identify the problem and describe one fix.

counter = 0

def update():
    global counter
    for _ in range(1000):
        counter += 1

The problem is a race condition: counter += 1 reads, modifies, and writes shared state, so concurrent threads can overwrite each other’s updates. One fix is to protect the update with a lock:

from threading import Lock

counter = 0
lock = Lock()

def update():
    global counter
    for _ in range(1000):
        with lock:
            counter += 1

Another fix is to let each worker compute a private count and sum the private counts after all workers finish.

  1. Sketch the MPI communication pattern for a program where each rank computes a partial NumPy array and rank 0 combines all partial arrays into one total result.

from mpi4py import MPI
import numpy as np

comm = MPI.COMM_WORLD
rank = comm.Get_rank()

partial = compute_partial_array()
partials = comm.gather(partial, root=0)

if rank == 0:
    total = np.zeros_like(partials[0])
    for value in partials:
        total += value

This uses collective communication. A point-to-point version could have non-root ranks send their arrays to rank 0, while rank 0 receives and accumulates them.

  1. A Dask array computation with very small chunks creates millions of tasks and runs slowly. Explain what you would change and how you would use the dashboard or timing measurements to evaluate the change.

    Increase the chunk size so each task does more useful work and the scheduler handles fewer tasks. The chunks should still fit comfortably in worker memory, with enough chunks to keep workers busy. Compare wall-clock time before and after rechunking, and use the Dask dashboard to inspect task count, worker utilization, memory use, spilling, and scheduling overhead.