DNN Roofline & Benchmarking (Final, corrected)

# HW1 Part 2 DNN Roofline & Benchmarking (Final, corrected)

#

# Run the notebook cells sequentially in Jupyter/Colab.

# Requirements: `torch`, `torchvision`, `timm`. Optional but recommended: `thop`, `ptflops`.

# This notebook implements Parts 15 and includes brief discussion Markdown blocks after each part.

#

# %%

# Cell 1: imports & device detection

# %%

# %% [markdown]

# ## Part 1 Chip analysis & roofline plot

# We collect peak FLOPs and memory bandwidth for a set of diverse chips including CPU, GPU, ASIC, and SoC.

# Replace numeric values with exact datasheet numbers + citations in your report.

# %%

# Cell 2: CHIP TABLE (datasheet example numbers)

# %%

# Cell 3: Roofline plotting helper

# %%

# %% [markdown]

# **Discussion Chip Analysis**

#

# In this comparison, high-end data center GPUs such as the NVIDIA A100 achieve the highest peak FLOPs and memory bandwidth, making them ideal for compute-intensive DNN workloads. ASICs such as Google TPU v3 are optimized for matrix operations and deliver excellent energy efficiency but are less flexible than GPUs. General-purpose CPUs (e.g., Intel Xeon Platinum 8380) have the lowest FLOPs but provide better versatility for mixed workloads. Mobile/SoC devices like Apple M2 and NVIDIA Jetson Orin NX demonstrate significantly lower absolute performance but much higher energy efficiency per watt, suitable for edge inference. Overall, GPUs dominate in raw performance, while ASICs and SoCs optimize for specialized efficiency.

# %%

# Cell 4: Models to analyze (>=6)

# %%

# %% [markdown]

# ## Part 2 DNN Compute & Memory Analysis

# For each selected model, compute FLOPs, parameter count, activation bytes, and operational intensity.

# FLOPs are computed using `thop` (MACsFLOPs) if available; otherwise `ptflops` is used (converted as needed).

# %%

# Cell 5: FLOPs & params computation (thop preferred) and activation memory estimation

# %%

# %% [markdown]

# **Discussion DNN Compute and Memory Analysis**

#

# The FLOPs and parameter counts vary widely across models.

# Lightweight models such as MobileNet V2 and EfficientNet-B0 show low FLOPs and memory footprints, resulting in higher operational intensity and better suitability for mobile/edge devices.

# In contrast, deeper networks like ResNet-50 and ViT-Base exhibit much higher FLOPs but also higher memory demands, making them more compute-bound.

# When overlaid on the GPU roofline, lightweight models tend to be **memory-bound**, while heavy architectures approach the **compute-bound** region of the curve.

# %%

# Cell 6: Visualize GFLOPs, Params, and Operational Intensity

# %%

# Cell 7: Overlay model operational intensities on the primary chip roofline

# %%

# %% [markdown]

# **Discussion DNN Compute/Memory Overlay**

#

# Overlaying operational intensity on the GPU roofline shows which models are memory-bound versus compute-bound on the primary GPU. Models with low operational intensity (small FLOPs per byte) land in the bandwidth-limited regime; models with large FLOPs per byte approach the compute limit.

# %%

# Cell 8: Benchmarking helpers and run benchmarks at batch sizes {1,64,128,256}

# %%

# %% [markdown]

# **Discussion Benchmarking**

#

# Latency generally increases with model complexity but is not perfectly correlated with FLOPs or parameter count. For convolutional models, FLOPs often predict latency better; transformer-based models may exhibit memory- and scheduling-induced deviations. Throughput scales with batch size until memory or hardware saturation limits further gains.

# %%

# Cell 9: Latency vs FLOPs and Latency vs Params (annotated)

# %%

# %% [markdown]

# **Discussion FLOPs vs Parameters as latency predictors**

#

# FLOPs sometimes align with latency for compute-bound workloads, but memory access patterns, kernel implementations, and batching affect runtime. Parameter count correlates imperfectly: high parameter models can be memory-hungry but not necessarily slow if compute is optimized. Use both metrics and measured latency to draw conclusions.

# %%

# Cell 10: Throughput vs batch (use base=2 for log scale)

# %%

# Cell 11: Measured performance overlay on roofline (compute FLOPs/sec from measured latencies)

# %%

# %% [markdown]

# **Discussion Hardware Utilization and Peak Performance**

#

# The measured GPU performance points are typically below the theoretical roofline. This gap arises from kernel launch overhead, memory access inefficiencies, and less-than-ideal use of specialized units (e.g., tensor cores). Models with low operational intensity are limited by memory bandwidth; compute-heavy models may approach the FLOP ceiling but still show headroom lost to implementation inefficiencies.

# %%

# Cell 12: Forward vs Backward runtime (runtime measured) and FLOPs heuristic

# %%

# %% [markdown]

# **Discussion Inference vs Training**

#

# The backward pass consistently takes roughly 23 the time of the forward pass, matching the expected FLOPs ratio. This occurs because gradients must be computed and stored for every parameter during training. Deeper models like ResNet-50 often show larger backward-to-forward ratios due to more operations and memory traffic. Pie charts (below) break down latency/FLOPs/activation by layer type for the forward pass and estimate the backward breakdown.

# %%

# Cell 13: Per-layer breakdown (resnet50 example) + forward/backward pie charts

# %%

# %% [markdown]

# **Discussion Per-layer breakdown and forward/backward**

#

# Convolutional and linear layers dominate both FLOPs and latency. The backward pass is estimated as ~2 forward FLOPs, and backward latency is distributed proportionally to forward-layer time when measured backward time is available. This provides an actionable decomposition for optimization.

# %%

# Cell 14: final notes + saved files

# Part 1a Chip Roofline Plot

# Define chip specs and plot_roofline function

# Part 2c Operational Intensity Overlay

# Collect model results into a list of dicts

# (Replace the FLOPs and memory values with the ones you already computed in Part 2a/2b)

# ————————————————-

# Part 2a & 2b Model FLOPs, Parameters, and Memory Footprint

# ————————————————-

# Pick the best device available (GPU MPS CPU)

# Example input for profiling (batch size 1, 3 channels, 224×224 image)

# Define the models you want to analyze

# Collect results

# —— Bar Chart: FLOPs ——

# —— Bar Chart: Parameters ——

# —— Bar Chart: Memory Footprint ——

WRITE MY PAPER