屋顶线分析简介¶
A Brief Intro to Roofline Analysis
📖 章节信息¶
- 所属部分:预备知识
- 原文章节:A Brief Intro to Roofline Analysis
- 翻译时间:2026年03月30日
- 原文地址:https://jax-ml.github.io/scaling-book/roofline
🎯 学习目标¶
通过本章学习,您将了解:
- 核心概念:掌握屋顶线分析简介的基本原理
- 实践应用:了解在实际项目中的应用方法
- 技术细节:深入理解相关的技术实现
- 最佳实践:学习行业内的最佳实践方法
翻译说明: - 本文为《A Brief Intro to Roofline Analysis》的中文翻译 - 原文档地址:https://jax-ml.github.io/scaling-book/a-brief-intro-to-roofline-analysis - 翻译时间:2026年03月30日 - 翻译状态:初步翻译,正在完善中
注意事项: 1. 技术术语尽量保持原意 2. 复杂概念添加中文解释 3. 公式和代码保持原样 4. 图表引用原文档
All About Rooflines | How To Scale Your Model * How To Scale Your Model Toggle navigation * Previous Part * Next Part * Sections Part 0. Introduction Part 1. Intro to Rooflines Part 2. All About TPUs Part 3. Sharded Matmuls Part 4. Transformers Part 5. Training Part 6. Training LLaMA Part 7. Inference Part 8. Serving LLaMA Part 9. Profiling Part 10. All About JAX Part 11. Conclusions Part 12. GPUs
*
# All About Rooflines Part 1 of [How To Scale Your Model](/scaling-book) ([Part 0: Introduction](..) | [Part 2: TPUs](../tpus))
When we run algorithms on hardware, we're bounded by three things: how fast our computer can do math (OPs/second), the 带宽 available for moving data around (bytes/second), and the total 内存 available to store data (bytes). These “roofline” constraints let us upper and lower bound the time of a given 计算.
### Contents [Where Does the Time Go?](#where-does-the-time-go) [](#) * [Visualizing rooflines](#visualizing-rooflines)
A Few Problems to Work ## Where Does the Time Go? Let’s start with an extremely simple question: why does an algorithm take 50ms instead of 50s or 5ms? What is actually happening within the model that takes substantial time and how long should we expect it to take?
Computation: A deep learning model is effectively a bunch of 矩阵 multiplications, each composed of floating-point 乘法 and addition ‘operations’ (FLOPs). Our accelerator speed determines how long these take to compute:
[\begin{equation} T_\text{math} = \frac{\text{Computation FLOPs}}{\text{Accelerator FLOPs/s}} \end{equation}] For instance, an NVIDIA H100 can perform about 9.89e14 bfloat16bf16 is short for bfloat16, a 16-bit floating point format often used in ML. FLOPs/s while a TPU v6e can perform 9.1e14 FLOPs/s.H100s and B200s can usually only achieve around 80-85% of the claimed peak FLOPs, while TPUs can get closer to 95% in normal use. That means doing 1e12 FLOPs on an H100 will take (roughly) 1e12 / 9.89e14 = 1.01ms and 1e12 / 9.1e14 = 1.1ms on a TPU v6e.Note that these chips are priced differently, and this comparison does not normalize to cost.
Communication within a chip: Within an accelerator, tensors need to be transferred between accelerator 内存 (HBM) and the compute cores. You’ll see the 带宽 of this link referred to as “HBM bandwidth”NVIDIA also calls this "memory 带宽." On an H100, this is about 3.35TB/s and on TPU v6e this is about 1.6TB/s.
Communication between chips: When we distribute a model across multiple accelerators, tensors frequently need to be transferred between them. There are often a few options for this on our hardware (ICI, DCN, and PCIe), each with different bandwidths.
Whether the 通信 is within a chip or between chips, we measure this in bytes/s and estimate the total 通信 time with:
[\begin{equation} T_\text{comms} = \frac{\text{Communication Bytes}}{\text{Network/Memory Bandwidth Bytes/s}} \end{equation}] Typically (but not always), 计算 within a single chip can be overlapped with 通信 within a chip and between chips. This means we can lower-bound 训练 and 推理 time by using the maximum of 计算 and 通信 time. We can also upper-bound with their sum. In practice, we optimize against the maximum as the algebra is simpler and we can usually come close to this bound by overlapping our 通信 and 计算. If we optimize with the maximum in mind then the lower and upper bounds differ by at most a factor of 2 since \(T_\text{math} + T_\text{comms} \leq 2 * \max(T_\text{math}, T_\text{comms})\). We then increase accuracy beyond this by modeling ‘overlap regions’ and overheads, which can be informed by profiling your specific model and target system.
[\begin{equation} T_\text{lower}=\max(T_\text{math}, T_\text{comms}) \end{equation}] [\begin{equation} T_\text{upper} = T_\text{math} + T_\text{comms} \end{equation}] If we assume we can perfectly overlap 通信 and 计算, when \(T_\text{math} > T_\text{comms}\), we see full utilization from our hardware. We call this being “compute-bound”. When \(T_\text{comms} > T_\text{math}\), we tend to be “communication-bound” and at least some fraction of our accelerator FLOPs/s is wasted waiting for data to be passed around. One way to tell if an operation will be compute or communication-bound is to look at its “arithmetic intensity” or “operational intensity”.
Definition: the arithmetic intensity of an algorithm is given by the ratio of the total FLOPs it performs to the number of bytes it needs to communicate — either within a chip or between chips.
[\begin{equation} \text{Arithmetic Intensity} = \frac{\text{Computation FLOPs}}{\text{Communication Bytes}} \end{equation}] Arithmetic intensity measures the “FLOPs per byte” of a given operation. To a first order, when our arithmetic intensity is high, \(T_\text{math}\) is large compared to \(T_\text{comms}\) and we typically use most of the available FLOPs. When the opposite is true, we spend more time on comms and waste FLOPs. The point where this crossover happens is the “peak arithmetic intensity” of our hardware, the ratio of peak accelerator FLOPs/s to accelerator 带宽.
[\begin{align} T_\text{math} > T_\text{comms} \Leftrightarrow \frac{\text{Computation FLOPs}} {\text{Accelerator FLOPs/s}} > \frac{\text{Communication Bytes}}{\text{Bandwidth Bytes/s}} & \[0.5em] \Leftrightarrow \frac{\text{Computation FLOPs}}{\text{Communication Bytes}} > \frac{\text{Accelerator FLOPs/s}}{\text{Bandwidth Bytes/s}} & \[0.5em] \Leftrightarrow \text{Intensity}(\text{Computation}) > \text{Intensity}(\text{Accelerator}) & \ \end{align}] The quantity \(\text{Intensity}(\text{Accelerator})\) is the arithmetic intensity at which our accelerator achieves its peak FLOPs/s. For the TPU v5e MXU, this is about 240 FLOPs/byte, since the TPU can perform 1.97e14 FLOPs/s and load 8.2e11 bytes/s from HBM.The MXU is the 矩阵 multiply unit on the TPU. We specify this here because the TPU has other accelerators like the VPU that are responsible for elementwise operations that have a different peak FLOPs/s. That means if an algorithm has a lower arithmetic intensity than 240 FLOPs/byte, it will be bound by byte loading and thus we won’t make good use of our hardware.This is only true if the algorithm loads its weights from HBM and runs in the MXU. As we'll discuss in the next section, we can sometimes store parameters in VMEM which has a much higher 带宽. Many algorithms also run in the VPU, which has different performance characteristics. Let’s look at one such example:
Example (dot product): to compute the dot product of two vectors in bfloat16 precision, x • y: bf16[N], bf16[N] → bf16[1], we need to load \(x\) and \(y\) from 内存, each of which has \(2 * N = 2N\) bytes, perform \(N\) multiplications and \(N-1\) additions, and write \(2\) bytes back into HBM \(\begin{equation} \text{Intensity}(\text{dot product}) = \frac{\text{Total FLOPs}}{\text{Total Bytes}} = \frac{N + N - 1}{2N + 2N + 2} = \frac{2N - 1}{4N + 2} \rightarrow \frac{1}{2} \end{equation}\)
as \(N\rightarrow\infty\). So the dot product has an arithmetic intensity of \(\frac{1}{2}\) or, put another way, the dot product does 0.5 floating point operations per byte loaded. This means our arithmetic intensity is lower than that of our hardware and we will be communication-bound.The 240 number above is not the correct comparison here since, as you will see in the next section, a dot-product is performed on the VPU and not the MXU. The TPU v5p VPU can do roughly 7e12 FLOPs / second, so its critical intensity is around 3, which means we are still somewhat comms-bound here. Either way, the fact that our intensity is low and constant means it is difficult to be compute-bound on most hardware.
### Visualizing rooflines We can visualize the tradeoff between 内存 and compute using a roofline plot, which plots the peak achievable FLOPs/s (throughput) of an algorithm on our hardware (the y-axis) against the arithmetic intensity of that algorithm (the x-axis). Here’s an example log-log plot:
Figure: an example roofline plot showing two algorithms with different arithmetic intensities (Algo 1 and Algo 2) and their corresponding theoretical peak throughput under different bandwidths (BW1 and BW2). In the red area, an algorithm is 带宽 bound at both bandwidths and is wasting some fraction of the hardware's peak FLOPs/s. The yellow area is bandwidth-bound only at the lower 带宽 (BW1). The green area is compute-bound at all bandwidths. Here, we are using the peak FLOPs/s of the accelerator and increasing 带宽 or improving intensity yield no benefit. Above, as the intensity increases (moving left to right), we initially see a linear increase in the performance of our algorithm (in FLOPs/s) until we hit the critical arithmetic intensity of the hardware, 240 in the case of the TPU v5e. Any algorithm with a lower intensity will be 带宽 (BW) bound and limited by the peak 内存 带宽 (shown in red). Any algorithm to the right will fully utilize our FLOPs (shown in green). Here, Algo 1 is comms-bound and uses only a fraction of the total hardware FLOPs/s. Algo 2 is compute-bound. We can generally improve the performance of an algorithm either by increasing its arithmetic intensity or by increasing the 内存 带宽 available (moving from BW1 to BW2).
### Matrix 乘法 Let’s look at our soon-to-be favorite algorithm: 矩阵 乘法 (aka matmul). We write \(X * Y \rightarrow Z\) where \(X\) has shape \(\text{bf16}[B, D]\), \(Y\) has shape \(\text{bf16}[D, F]\), and \(Z\) has shape \(\text{bf16}[B, F]\). To do the matmul we need to load \(2DF + 2BD\) bytes, perform \(2BDF\) FLOPs, and write \(2BF\) bytes back.Technically we perform \(BF \times (2D - 1)\) FLOPs but this is close enough. This comes from \(BDF\) multiplications and \(BF * (D-1)\) additions. Section 4 has more details. Although the output of a matmul is technically float32 we usually cast down to bfloat16 before copying back to HBM. Thus:
[\begin{equation} \text{Intensity}(\text{matmul}) = \frac{2BDF}{2BD + 2DF + 2BF} = \frac{BDF}{BD + DF + BF} \end{equation}] We can get a nice simplification if we assume our “batch size” \(B\) is small relative to \(D\) and \(F\). Then we get
[\begin{equation} \frac{BDF}{BD + DF + BF} \approx \frac{BDF}{DF} = B \end{equation}] [\begin{equation} \text{Intensity}(\text{matmul}) > \text{Intensity}(\text{TPU}) \implies B > \frac{1.97e14}{8.20e11} = 240 \end{equation}] This is a reasonable assumption for Transformer matmuls since we typically have a local (per-replica) batch size \(B < 1024\) tokens (not sequences) but \(D\) and \(F > 8000\). Thus we generally become compute-bound when our per-replicaWe say per-replica because, if we do some kind of model sharding to increase the number of chips used in the matmul, we scale both our available compute and 内存 带宽 by the same amount. Thus the critical batch size is true per independent copy of the model weights. batch size is greater than 240 tokens, a very simple rule!
Takeaway: for a bfloat16 matmul to be compute-bound on most TPUs, we need our per-replica token batch size to be greater than 240.Note that this is not the batch size in the usual sense, where it means the batch size in sequences. It turns out most rooflines depend purely on the number of tokens, whether they belong to the same or different sequences. For instance if you have a batch size of 512 sequences of 4096 tokens on 128 GPUs, you have a total batch size of 512 * 4096 = 2M tokens, and a local batch size of 16k tokens.
This comes with a few notable caveats we’ll explore in the problems below, particularly with respect to quantization (e.g., if we quantize our activations but still do full-precision FLOPs), but it’s a good rule to remember. For GPUs, this number is slightly higher (closer to 300), but the same conclusion generally holds. When we decompose a big matmul into smaller matmuls, the tile sizes also matter.When we do a large 矩阵 乘法, we need to break it down into smaller tiles which fit into VMEM/SMEM/TMEM, the higher-bandwidth on-chip 内存. This causes us to load chunks multiple times, so it's no longer quite true that we only load \(O(N^2)\) bytes. Consider an \((m, k) \cdot (k, n)\) matmul with tile sizes \(bm\), \(bk\), \(bn\). Let \(tm = m / bm\), etc. Then the total FLOPs is \(2 \cdot tm \cdot tn \cdot tk \cdot bm \cdot bn \cdot bk\) and the total bytes are \(2 \cdot tm \cdot tn \cdot (tk \cdot (bm \cdot bk + bk \cdot bn) + 2 \cdot bm \cdot bn)\). Ignoring the last term, we have an intensity of \(bm \cdot bn / (bm + bn)\), which is similar to the above. We’ll discuss the lower-level GPU and TPU details in the next section.
### Network 通信 rooflines All the rooflines we’ve discussed so far have been memory-bandwidth rooflines, all within a single chip. This shouldn’t be taken as a rule. In fact, most of the rooflines we’ll care about in this book involve 通信 between chips: usually 矩阵 multiplications that involve matrices sharded across multiple TPUs.
To pick a somewhat contrived example, say we want to multiply two big matrices \(X\sim \text{bfloat16[B, D]}\) and \(Y \sim \text{bfloat16[D, F]}\) which are split evenly across 2 TPUs/GPUs (along the \(D\) dimension). To do this 乘法 (as we’ll see in Section 3), we can multiply half of each 矩阵 on each TPU (A = X[:, :D // 2] @ Y[:D // 2, :] on TPU 0 and B = X[:, D // 2:] @ Y[D // 2:, :] on TPU 1) and then copy the resulting “partial sums” to the other TPU and add them together. Say we can copy 4.5e10 bytes/s in each direction and perform 1.97e14 FLOPs/s on each chip. What are \(T_\text{math}\) and \(T_\text{comms}\)?
\(T_\text{math}\) is clearly half of what it was before, since each TPU is doing half the work, i.e.We're ignoring the FLOPs required to add the two partial sums together (another BF addition), but this is basically negligible.
[T_\text{math} = \frac{2BDF}{2 \cdot \text{Accelerator FLOPs/s}} = \frac{BDF}{1.97e14}] Now what about \(T_\text{comms}\)? This now refers to the 通信 time between chips! This is just the total bytes sent divided by the network 带宽, i.e.
[T_\text{comms} = \frac{2BF}{\text{Network Bandwidth}} = \frac{2BF}{4.5e10}] Therefore we become compute-bound (now with respect to the inter-chip network) when \(\text{Intensity}(\text{matmul (2-chips)}) > \text{Intensity}(\text{TPU w.r.t. inter-chip network})\) or equivalently when \(\frac{BDF}{2BF} = \frac{D}{2} > \frac{1.97e14}{4.5e10} = 4377\) or \(D > 8755\). Note that, unlike before, the critical threshold now depends on \(D\) and not \(B\)! Try to think why that is. This is just one such example, but we highlight that this kind of roofline is critical to knowing when we can parallelize an operation across multiple TPUs.
## A Few Problems to Work Question 1 [int8 matmul]: Say we want to do the matmul \(X[B, D] \cdot_D Y[D, F] \rightarrow Z[B, F]\)Here and throughout we'll use the notation \(A \cdot_D B\) to indicate that the 乘法 is performing a contraction over the D dimension. This is an abuse of einsum notation. in int8 precision (1 byte per parameter) instead of bfloat16 (2 bytes per parameter) since TPUs/GPUs can do matmuls faster in lower precision.
- How many bytes need to be loaded from memory? How many need to be written back to memory?
- How many total OPs are performed?
- What is the arithmetic intensity?
- What is a roofline estimate for \(T_\text{math}\) and \(T_\text{comms}\)? What are reasonable upper and lower bounds for the runtime of the whole operation?
Assume our HBM 带宽 is 8.1e11 bytes/s and our int8 peak OPs/s is 3.94e14 (about 2x bfloat16).
Click here for the answer. * Because we’re storing our parameters in int8, we have 1 byte per parameter, so we have \(BD + DF\) bytes loaded from HBM and \(BF\) written back. * This is the same as in bfloat16, but in theory int8 OPs/s should be faster. So this is still \(2BDF\) FLOPs. * Arithmetic intensity is \(2BDF / (BD + DF + BF)\). If we make the same assumption as above about \(B \ll D\) and \(B \ll F\), we get an arithmetic intensity of \(2B\), meaning our rule becomes \(B > \text{HBM int8 arithmetic intensity} / 2\). Using the numbers given, this int8 intensity is 3.94e14 / 8.1e11 = 486, so the rule is \(B > 486 / 2 = 243\). Note that this is basically unchanged! * \(T_\text{math} = 2BDF / 3.94e14\) and \(T_\text{comms} = (BD + DF + BF) / 8.1e11\), so a reasonable lower bound is \(\max(T_\text{math}, T_\text{comms})\) and an upper bound is \(T_\text{math} + T_\text{comms}\).
Question 2 [int8 + bf16 matmul]: In practice we often do different weight vs. activation quantization, so we might store our weights in very low precision but keep activations (and compute) in a higher precision. Say we want to quantize our weights in int8 but keep activations (and compute) in bfloat16. At what batch size do we become compute bound? Assume 1.97e14 bfloat16 FLOPs/s.
Hint: this means specifically bfloat16[B, D] * int8[D, F] -> bfloat16[B, F] where \(B\) is the “batch size”.
Click here for the answer. Again assuming B is small, we have 2BDF bfloat16 FLOPs but only DF weights (instead of 2DF in bfloat16). This means we become compute-bound when \(2B > 240\) or \(B > 120\). This is a lot lower, meaning if we can do int8 weight quantization (which is fairly easy to do) but still do bfloat16 FLOPs, we get a meaningful win in efficiency (although int8 OPs would be better).
Question 3: Taking the setup from Question 2, make a roofline plot of peak FLOPs/s vs. \(B\) for \(F = D = 4096\) and \(F = D = 1024\). Use the exact number of bytes loaded, not an approximation.
Click here for the answer. Here is the plot in question:
Note that both models eventually achieve the peak hardware FLOPs/s, but the larger D/F achieve it sooner. D=F=1024 almost doubles the critical batch size. The code to generate this figure is here:
``` import matplotlib.pyplot as plt import numpy as np
bs = np.arange(1, 512)
def roofline(B, D, F): total_flops = 2BDF flops_time = total_flops / 1.97e14 comms_time = (2BD + DF + 2BF) / 8.2e11 total_time = np.maximum(flops_time, comms_time) return total_flops / total_time
roofline_big = roofline(bs, 4096, 4096) roofline_small = roofline(bs, 1024, 1024)
plt.figure(figsize=(8, 4)) plt.plot(bs, roofline_big, label='F=D=4096') plt.plot(bs, roofline_small, label='F=D=1024') plt.legend() plt.xlabel('batch size') plt.ylabel('peak bfloat16 FLOPs/s on TPU v5e') plt.grid()
``` Question 4: What if we wanted to perform \(\text{int8[B, D]} *_D \text{int8[B, D, F]} \rightarrow \text{int8[B, F]}\) where we imagine having a different 矩阵 for each batch element. What is the arithmetic intensity of this operation?
Click here for the answer. Let’s start by looking at the total FLOPs and comms.
- Total FLOPs: the FLOPs is basically the same, since we’re doing the same number of \(BD \times DF\) matmuls (this is discussed more in section 4). So this is just \(2BDF\).
- Total comms: we have a lot more comms here: \(BD + BDF + BF\).
- Therefore, our arithmetic intensity is now actually \(2BDF / (BD + BDF + BF)\). Since \(BDF\) dominates the denominator, this is roughly \(2\). So instead of it depending on the batch size, this is essentially constant. This is bad because it means we’ll basically always be comms bound no matter what.
Problem 5 [Memory Rooflines for GPUs]: Using the spec sheet provided by NVIDIA for the H100 SXM, calculate the batch size at which a bfloat16 矩阵 乘法 will become compute-bound. Note that the Tensor Core FLOPs numbers are twice the true value since they’re only achievable with structured sparsity.
Click here for the answer. From the spec sheet, we see that the reported bfloat16 FLOPs value is 1.979e15 FLOPs/s with an asterisk noting “with sparsity”. The true value is half this without sparsity, meaning close to 1e15 FLOPs/s. The 内存 带宽 is 3.35TB/s, or 3.35e12 bytes / second. Thus \(B_\text{crit}\) is 1e15 / 3.35e12 = 298, rather similar to the TPU.
### That’s it for Part 1! For Part 2, looking at how real TPUs handle FLOPs and 通信, click here. ### Miscellaneous *Work done at Google DeepMind, now at MatX.
### Citation For attribution in academic contexts, please cite this work as:
``` Austin et al., "How to Scale Your Model", Google DeepMind, online, 2025.
``` or as a BibTeX entry:
``` @article{scaling-book, title = {How to Scale Your Model}, author = {Austin, Jacob and Douglas, Sholto and Frostig, Roy and Levskaya, Anselm and Chen, Charlie and Vikram, Sharad and Lebron, Federico and Choy, Peter and Ramasesh, Vinay and Webson, Albert and Pope, Reiner}, publisher = {Google DeepMind}, howpublished = {Online}, note = {Retrieved from https://jax-ml.github.io/scaling-book/}, year = {2025} }
``` Please enable JavaScript to view the comments powered by giscus. © Copyright 2026 . Powered by Jekyll with al-folio theme. Hosted by GitHub Pages.
📚 延伸阅读¶
💡 学习建议¶
- 实践结合:在学习理论的同时进行代码实践
- 循序渐进:按照章节顺序逐步深入学习
- 问题驱动:带着实际问题学习效果更好
- 社区交流:参与相关技术社区讨论
本翻译由OpenClaw自动生成,正在不断完善中。 如有翻译问题或建议,欢迎通过博客反馈渠道联系。