Continuous Profiling


Profiling systems typically fall into one of the two categories. The first group uses binary modification, compiler support, or direct simulation of programs to gather measurements. They introduce significant overhead and usually require significant user intervention, so they are not deployed on production systems. The other group uses statistical sampling to collect fine-grained information on program or system behavior. They incur much smaller overhead but relies on existing source of interrupts (e.g. timer interrupts) to generated samples. This prevents them from sampling within those interrupt routines and can result in correlation between sampling and other system activity (bias).

The profiling system presented in this paper falls into the second category. It consists of two parts: a data collection subsystem that runs during program execution, and a suite of analysis tools that analyzes the collected data.

The data collection subsystem consists of 3 interacting components: a kernel device driver that services performance counter interrupts; a user-mode daemon process that extracts samples from the driver; and a modified system loader and other mechanisms for identifying executable images and where they are loaded by each running process.

First, each processor generates a high-priority interrupt after a specified number of events have occurred. The kernel device driver handles this interrupt by recording the PID of the running process, the program counter (PC), and the event type causing the interrupt. Since interrupts are frequently issued, the kernel device driver needs to handle them quickly. The driver keeps a per-process hash table to keep track of the number of times each (PID, PC, EVENT) tuple has occurred. This data structure avoids copying the recorded data to userspace on every timer interrupt. The hash table is implemented as an array of fixed size buckets. Whenever a new (PID, PC, EVENT) tuple is to be added to a full bucket, an old tuple in that bucket is evicted and copied to the user-mode daemon through a pair of overflow buffers. This basic design is efficient since it is cache friendly and synchronization is only needed when copying data into buffers. This paper handles this synchronization through inter-process interrupts. The user-mode daemon will map the events to the corresponding location in the executable image (through PID and PC) and store the records to disk.

After the samples are collected through the data collection subsystem, the analysis tool can be used offline to analyze the profiling results. It will identify for each instruction a frequency (number of times it is executed for a fixed time period), a cpi (average number of cycles spent in one execution), and a set of culprits (possible explanations for any wasted issue slots).

Estimating frequency and cpi is tricky since the data collection subsystem only shows the amount of time spent in this instruction relative to overall execution time, indicated by the sample count. More specifically, for each instruction $i$, its frequency $F_i$, cpi $C_i$, and sample frequency $S_i$ should satisfy $F_i\cdot C_i=S_i$. $S_i$ can be calculated directly from the sampling result. The analysis tool estimates $F_i$ through the following: it first does a static analysis on the program to identify the instructions that are guaranteed to execute for the same amount of time (i.e. basic blocks). This splits the program into a control-flow graph of basic blocks. It then runs each basic block $I$ individually to obtain an approximate $\sum_{i\in I} C_i$. Then the estimated frequency is $F_j=\sum_{i\in I} S_i/\sum_{i\in I} C_i$ for any instruction $j\in I$. Then it can calculate each $C_j$ accordingly.

To find the list of all possible culprits, this paper separately discusses static and dynamic causes. For static causes, it schedules instructions in each basic block using an accurate model of the processor issue logic and assume no dynamic stalls. With detailed record-keeping and accurate model, it is able to identify the static causes. For dynamic causes, it uses different techniques for each cause to check if it is possible that the dynamic stall is caused by this.

Anderson, Jennifer M., et al. “Continuous profiling: Where have all the cycles gone?.” ACM Transactions on Computer Systems (TOCS) 15.4 (1997): 357-390.



Existing DNN frameworks manages the DNN operators in a data flow graph. The library (e.g. PyTorch) schedules each operator individually and relies on the hardware scheduling (e.g. cuDNN) to exploit parallelism within operators. This two-layer scheduling scheme works well only when the kernel launching time is largely negligible compared to execution time and when there is sufficient intra-operator parallelism to saturate all processing units, but precludes opportunities to run multiple operators in parallel on the same GPU.

(a) shows the two-layer scheduling approach; (b) is a more efficient scheduling plan. Notice that this more aggressive plan requires that Operator 0 and 1 do not depend on each other.

Rammer is a deep learning compiler aimed to unify the inter- and intra-operator scheduling. It defines each DNN operator as rOperator and splits the rOperators into rTasks. A rTask is the smallest unit of scheduling and will run on a single processing unit (e.g., SM in GPU). We can think of rTasks as thread blocks. Rammer also introduces a special rTask, namely barrier rTask, that stalls execution until a set of rTasks has completed. Another abstraction that Rammer provides is rKernels, which corresponds to the actual implementation of rOperators (e.g., if the rOperator is convolution, then rKernel can be matrix multiplication, FFT, etc). Notice that different rKernels will split the rOperator into different rTasks.

Rammer abstracts the hardware as a virtualized parallel device (vDevice, corresponding to GPUs) composed of multiple virtualized execution units (vEU, corresponding to the SMs). This paper achieves single-layer scheduling by assigning rTasks to different vEUs at compile time, and then pin the compiled vEUs on the hardware processing units. From the DNN model, Rammer generates a static execution plan. This plan is broken into multiple parts called rProgram, which is represented as a 2D array of rTasks where the first index represents on which vEU this rTask is assigned to and the second index represents the order in which it will be run on that vEU. Each rProgram runs on a single rDevice. Rammer thereby achieves scheduling over multiple hardware devices (GPUs).

Rammer architecture. Accelerator refers to the hardware processing units.

The architecture of Rammer is shown in the figure above. After obtaining the DNN model, Rammer first transforms it to a DFG of rOperators. It does some compile-time profiling to figure out which rKernel is the most efficient through profiling and heuristics. Then the rOperator can be split into rTasks. Rammer uses a Wavefront Scheduling Policy, which is essentially a BFS on the DFG of rOperators. Here wavefront refers to the rTasks that do not depend on any other unscheduled rTasks. The policy iterates through the rTasks in the wavefront and assigns the current rTask to the vEU that becomes available first (based on profiling results). However, if the profiling result shows that assigning the current rTask to the current rProgram does not save execution time, it will put the rTask to a new rProgram that will be run on a different vDevice instead.


  • Rammer exploits the inter- and intra-operator parallelism holistically. It can provide higher GPU utilization compared with traditional two-level scheduling
  • The scheduling plan is statically generated, so it does not impose any runtime overhead.


  • Rammer is only beneficial when there is not sufficient intra-operator parallelism (e.g. in inference workloads) or when the kernel launching overhead is largely negligible. Yet often times neither is true in typical training workloads.
  • Rammer can only parallelize two operators if they are independent. With linear models (e.g, ResNet) there is not much Rammer can do.
  • Rammer generates scheduling plan statically. If the underlying hardware changes dynamically (e.g., shared between multiple models in data centers), it cannot adapt to the changes.

Ma, Lingxiao, et al. “Rammer: Enabling Holistic Deep Learning Compiler Optimizations with {rTasks}.” 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). 2020.

Automatic Differentiation


Consider a function $F:\mathbb{R}^n\rightarrow\mathbb{R}^m$ defined as a composition of multiple functions: $F:C \circ B \circ A$ (i.e., $F(x)=C(B(A(x)))$). By chain rule:

$$\nabla F(x)=\nabla C(y) \cdot \nabla B(z) \cdot \nabla A(x)$$

where $z=A(x)$ and $y=B(z)$. Recall that $\nabla F(x)$ is the Jacobian matrix of $F$ (and hence is an $m\times n$ matrix). Notice that each component of the right hand size equation can be computed independently if we know $x$, $y$, and $z$. This gives us two ways to compute $\nabla F(x)$: through forward accumulation, where

$$\nabla F(x)=\nabla C(y) \cdot (\nabla B(z) \cdot \nabla A(x))$$

or through reverse accumulation, where

$$\nabla F(x)=(\nabla C(y) \cdot \nabla B(z)) \cdot \nabla A(x)$$

In machine learning workloads, the last function is usually a loss function that produces a single scalar, so $m = 1$ and the reverse accumulation can be seen as a sequence of vector-Matrix multiplications instead of matrix-matrix multiplications. Hence the reverse accumulation often utilizes Vector-Jacobian Product (VJP) operators.

Autograd allows developers to only write the forward model using familiar Python APIs (NumPy, TensorFlow, PyTorch, etc). It will automatically perform differentiations. It has two main components: 1. it defines the VJP for primitives like np.sum, math.sin, etc; 2. it traces data flow at runtime and generates a trace graph. This trace graph stores the input ($x$), the intermediate values ($z$ and $y$), and the dependency relationship of compute primitives involved ($A$, $B$, and $C$). Since the graph is generated at runtime, autograd is unaware of the control flow (e.g., if-else, while, etc) of the program: it will only keep a linear flow.

At this point we can discuss some trade-offs between forward accumulation and backward accumulation. The forward accumulation has constant memory overhead as it does not need the runtime trace and can compute the gradient directly during the forward pass (you can calculate $\nabla B(z)$ when you calculate $B(z)$ since it does not depend on anything generated in later operators). But consider a DNN with multiple layers whose weights $w_1, w_2, …$ that needs to be trained, at any point you need to keep all $\frac{\partial G}{\partial w_1}, \frac{\partial G}{\partial w_2}, …$ where $G$ is the composition function from beginning to the current intermediate point. This requires huge memory overhead: for a regular CNN, the output of a layer can be $32\times32\times3$ and 30 layers, each with $3\times3\times3\times64$ kernels. This is almost 1.6G floating point numbers in total. With more complicated DNN with larger input size and more kernels, the number of gradients cannot be all kept in memory and we will have to calculate only a portion of them in one pass. Hence forward accumulation typically requires many forward passes under machine learning network.

In contrast, reverse accumulation can be done with only one forward and one backward pass, significantly reduce the amount of computation required: with reverse accumulation, during the backward pass, we only need to keep $\frac{\partial F}{\partial w}$ and $\frac{\partial F}{\partial z}$ where $w$ includes the weight of the current layer and $z$ includes all intermediate values passed into the current layer as input. Recall that $F$ outputs a single scalar, the memory required is very limited. This, however, depends on the runtime trace and hence as a memory overhead proportional to the depth of the machine learning model. Autograd can reduce this through checkpointing: you can define your own primitives to be a series of arithmetic operations (in contrast to a single arithmetic operation in pre-defined primitives). During the reverse accumulation, it simply redoes a forward pass within that primitive to figure out the trace graph within this primitive.

JAX is a domain-specific tracing JIT compiler for generating high-performance accelerator code from pure Python and Numpy machine learning programs. It first takes the original code and generates an intermediate representation, JAX IR, which is used as a static trace for automatic differentiation and optimizations. The backend compiler XLA (Accelerated Linear Algebra) will also take JAX IR to produce efficient machine codes. Compared with Autograd that generates tracing graphs dynamically during runtime, JAX is aware of control flows and can generate more compact tracing graphs. However, if the graph changes slightly (say, the input has a different size), JAX will have to regenerate the trace.

Roy Frostig, Matthew James Johnson, and Chris Leary. 2018. Compiling machinelearning programs via high-level tracing.Systems for Machine Learning(2018),23–24.



cuDNN is a library of efficient implementations of deep learning primitives. The main goal of cuDNN is to simplify maintenance of workloads in the fast development of hardware like GPUs and TPUs: the workloads can use cuDNN as arithmetic primitives without concerning about low level implementations. Hence this library needs to: 1. be easy to integrate into existing frameworks; 2. be optimized for performance and memory usage.

cuDNN supports forward and backward propagation variants of all its routines in single and double precision floating-point arithmetic. This include convolution, pooling, and activation functions. Pooling and activation functions are relatively easy to implement so we focus on convolution here. There are many ways to implement convolution efficiently. The goal is to achieve performance as close to matrix multiplication (a well-studied and highly-optimized workload) as possible without using any auxiliary memory from the host machine. GPU memory has high bandwidth but relatively small capacity compared with auxiliary memory. Convolutions are traditionally implemented in one of the 3 ways:

1. Unroll the convolution filter and input images to reduce it to matrix multiplications. As shown in figure below. This approach achieves great performance (since matrix multiplication is well optimized), but incurs high memory overhead as each cell on the input image is duplicated multiple times. To compensate this, the matrix can be created during calculation: we can divide one matrix (corresponding to the input images) into multiple submatrices and multiply one submatrix with another and materialize the next submatrix at the same time.

N: batch size; C: number of (color) channels; H: image height; W: image width; K: number of filters; R: height of filters; S: width of filters; u: stride in vertical direction; v: stride in horizontal direction; pad_h: padding in horizontal direction; pad_w: padding in vertical direction. These are all tunable parameters for convolution in cuDNN

2. Using Fast Fourier Transform. FFT can significantly lower the work complexity of the convolutions, but it uses a significant amount of temporary memory since the filer needs to be padded to be the same size as the inputs.

3. Compute convolution directly. Note that this can be very efficient but requires a large number of specialized implementations to handle the many corner cases, making it hard to maintain. In addition, such implementations typically work well for some convolutions but not the others depending on the 11 parameters shown in the above figure.

The 3 implementations each perform better than the other 2 under different scenarios, with different parameters, different running environment (e.g., GPU memory available, compute power, etc), or something else. In practice, all 3 are implemented in cuDNN and is configurable through its parameters.


  • A library of compute primitives reduces the amount of efforts to maintain machine learning models and workloads: it does not require much code change to reoptimize the code on a different hardware.
  • This library makes developing new machine learning workloads easy, as they do not need to worry about the detail implementations of these arithmetic primitives.
  • cuDNN provides a huge parameter space. It provides flexibility.


  • The abstraction of cuDNN library restricts flexibility. The programmer has to perform computations within the framework defined by cuDNN.
  • cuDNN is a relatively general library: we may be able to write more optimized code specific to our model.

Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, JohnTran, Bryan Catanzaro, and Evan Shelhamer. 2014. cudnn: Efficient primitives fordeep learning.arXiv preprint arXiv:1410.0759(2014)

The GPU Computing Era


Graphics Processing Units (GPUs) are designed for parallel computing. Its initial purpose and its main driving force are the real-time graphics performance needed for render complex high-resolution 3D scenes at interactive frame rates for games. These workloads require huge amount of computation to render each pixel in a timely manner. Yet the work to calculate each pixel can be done in parallel and the are largely analogous.

At beginning, GPUs are exclusively for graphics rendering and uses programming interfaces like OpenGL. Early attempts to use GPU for computation requires writing the code with these graphics interface. The first general purpose GPU with CUDA cores was GeForce 8800 introduced in 2006. It has CUDA cores and is programmable in standard languages like C.

CUDA is a hardware and software coprocessing architecture for parallel computing. A compiled CUDA program can be executed on any size GPU, automatically scaling to the number of cores and threads. It is organized into a host program consisting sequential threads running on the host CPU, and parallel kernels suitable for execution on GPU. The programmer or the compiler organizes the threads into thread blocks, the threads in the same thread block will be placed close to each other so they can communicate and coordinate at a relatively low cost through a local shared memory. Each GPU can run multiple grids of thread blocks that can access a per-application global memory space.

A GPU consists of multiple components. Take Fermi GPU as an example (shown in figure below), it consists of multiple streaming multiprocessor (SM, collection of cores), a GigaThread responsible for scheduling different thread blocks to different SM, a host interface that connects to the host through PCIe, 6 DRAM interface accessing the GPU DRAM, and a L2 cache shared across the SMs.

SM (as shown in the figure below) employs the single-instruction multiple-thread (SIMT, or SIMD, single-instruction multiple-data) architecture. The SIMT instruction logic manages concurrent threads in groups of 32 parallel threads called wraps. A CUDA thread block comprises one or more wraps. This architecture allows the cores the be placed more compactly, but data dependent control flows within the same wrap can lead to divergence (different paths) and impact performance. Each SM also has a local shared memory and a L1 cache. Fermi GPU manages host memory, GPU DRAM, and SM’s local memory in a unified memory space.

Table below shows a comparison of clock speed of modern processing units. The compute speed of CPUs are much faster than GPUs ($5\times$). However, GPU cores are much more compact than CPU cores: a CPU core can take $50\times$ more area than a GPU core. As a result, CPUs are better for sequential execution and GPUs are better at parallel execution. This leads to a heterogeneous CPU+GPU processing system. This design delivers better performance than a homogeneous system in various workloads ranging from 0.5% sequential and 99.5% parallel workload to 75% sequential and 25% parallel workload.

Computing UnitClock Speed (MHz)
Intel Xeon Platinum2~4 GHz