# Real-world Implementation and Future of AI Acceleration Systems based on Processing-in-Memory

Byeongho Kim, Ph.D. Samsung Electronics

# Outline

- □ Generative AI and Memory Requirement
- Memory Solutions for AI
  - New Memory Hierarchy
- Processing-in-Memory
  - PIM and PNM
  - HBM-PIM and other commercial PIM technology
  - Challenges on commercializing PIM
  - Next: LPDDR-PIM
- □ Summary

# Large-scale AI and Memory Wall

- New applications have already reached memory wall
  - Al applications are bottlenecked by communication overhead rather than compute.
  - Scaling rate of AI model (FLOPS & Parameters) far exceeds that of memory bandwidth/capacity.



\*https://medium.com/riselab/ai-and-memory-wall-2cb4265cb0b8

# **GPT** Characteristics

- Transformer decoder-based structure and two phases
  - Generation stage dominates the execution time.



 $\star$ : Layer which includes GEMV in Generation stage

\*J. Choi, Unleashing the Potential of PIM: Accelerating Large Batched Inference of Transformer-Based Generative Models, IEEE CAL, 2023

# **GPT** Characteristics

□ GEMV portion can be 60–80% of total generation latency



\*Profiling result is measured in A100 System (Fastertransformer + GPT-J, FP16, Input/Output token:32/32) GPT-j: Google JAX framework

# **GPT H/W Utilization Breakdown**

- □ As # of tokens increases, GEMV dominate inference time
- □ Max. utilization is limited by memory bandwidth on GEMV



# Memory Solution for GPT

- □ AI needs Higher Bandwidth Memory
  - Paradigm shift in AI:  $CNN \rightarrow LLM(GPT)$



Arithmetic Intensity(FLOPS/byte)

# Outline

Generative AI and Memory Requirement

- Memory Solution for AI
  - Memory Hierarchy
- Processing-in-Memory
  - PIM and PNM
  - HBM-PIM and other commercial PIM technology
  - Challenges on commercializing PIM
  - Next: LPDDR-PIM
- □ Summary

# High Bandwidth Memory Solutions for AI

- HBM provides the Highest Bandwidth in the Market
  - Various memory solution can be used depending on System



- Yellow-color memories are newly proposed.
- \* LHM: Low-power High-bandwidth Memory (Ex. LLW)
- \* CMM: CXL Memory Module
- The products underlined are currently commercially available.

# Besides Bandwidth, Energy also Matters

- Another Limitation of Von Neumann Architecture
  - DRAM consumes large energy to transfer data

| Operation           | Energy (pJ) |
|---------------------|-------------|
| 8b Add              | 0.03        |
| 16b Add             | 0.05        |
| 32b Add             | 0.1         |
| 16b FP Add          | 0.4         |
| 32b FP Add          | 0.9         |
| 8b Mult             | 0.2         |
| 32b Mult            | 3.1         |
| 16b FP Mult         | 1.1         |
| 32b FP Mult         | 3.7         |
| 32b SRAM Read (8KB) | 5           |
| 32b DRAM Read       | 640         |
|                     |             |

Relative Energy Cost

Source : Computing's Energy Problem (and what we can do about it) (ISSCC'14)

# **Near Memory Solution**

Each near memory solution has its role in the memory hierarchy

- High Bandwidth Memory (HBM) for HPC
- Low-Latency Wide-IO (LLW) for Mobile



# Outline

Generative AI and Memory Requirement

Memory Solution for Al

- Memory Hierarchy
- □ Processing-in-Memory
  - PIM and PNM
  - HBM-PIM and other commercial PIM technology
  - Challenges on commercializing PIM
  - Next: LPDDR-PIM
- □ Summary

# **Intelligent Memory and Types**

- □ Three distinct categories in this talk
  - CIM: use memory array as a processing unit
  - PIM: use embedded logic near memory array as a processing unit
  - PNM: use an additional chip for processing inside a memory package or a set



# **PIM: Renewed Interest**

ML workloads w/ growing model size need more frequent DRAM accesses, limiting performance and dominating energy consumption



# Processing-in-Memory (PIM)

- □ Utilize internal memory bandwidth by bank-level parallelism
  - Proposed by major DRAM vendors (Samsung & SK-Hynix)



#### Normal DRAM

PIM enabled DRAM

# **Overview of PIM Architecture**

- □ High on-chip compute bandwidth w/o changing DRAM core circuitry
  - Place SIMD FPU at bank IO boundary
  - Exploit bank-level parallelism: access multiple banks/FPUs in a lockstep manner
- Expose high on-chip bandwidth of standard DRAM to processors
  - Build on industry standard DRAM interfaces and preserve deterministic DRAM timing
  - i.e., a DRAM RD/WR command triggers execution of a PIM instruction



# **PIM-DRAM:** Microarchitecture

- Consist of three major components with DRAM local bus interface:
  - A 16-lane FP16 SIMD FPU array: a pair of 16 FP16 multipliers and adders
  - Register files: Command, General, and Scalar register files (CRF, GRF, and SRF)
  - A PIM unit controller (fetch and decode, controls pipeline signals, forward)



# **HBM-PIM Implementation**

- Based on a commercial HBM2 design
  - Off-chip and on-chip bandwidth: 1.23 TB/s and 4.92 TB/s



# **HBM-PIM Powered Systems**

Collaborated with two system-board companies



# HBM-PIM Evaluation – Performance/Power/Energy

- □ PIM-HBM improves energy efficiency by
  - both shorter execution time and lower average power consumption.



# **HBM-PIM Cluster**

□ Installed 96 AMD MI100 GPUs fabricated with HBM-PIM



# **Evaluation Results on MoE Model**

Performance 2x and Energy efficiency 3x compared to normal GPU



# PIM Value on GPT

- OpenAI's focus is on developing new AI technologies and pushing the boundaries of what is possible with AI, so it's possible that they will explore the use of PIM technology at some point in the future.
- AMD, access energy can be improved by execute the main algorithm kernel directly in memory
  - Dr. Lisa Su, Energy reduction by 85% compared to using conventional HBM



# Case Study on Commercial DRAM Technology

- UPMEM
- □ Hynix GDDR6
  - AiMX (PIM cluster) based on Hynix GDDR6



Reference: The true Processing In Memory accelerator, Hotchips '19





# On-Device AI & LPDDR-PIM

- Growing Importance of On-Device AI
  - Data center costs and power consumption are increasing
  - Privacy concerns are rising as sensitive data is transmitted to the cloud for processing
  - Network connectivity is not always reliable or available, particularly in remote areas
- LPDDR-PIM improves battery life and prevents memory over-provisioning just for bandwidth



# **LPDDR-PIM Introduction**

- LPDDR-PIM improves performance and energy efficiency of the system with in-DRAM processing
  - Performance: Utilizes up to 8x higher in-DRAM bandwidth by bank parallel operation
  - Energy Efficiency: Reduces data movement energy by utilizing in-DRAM processing unit



# LPDDR-PIM System Perf./Power Analysis

- □ LPDDR-PIM improves energy efficiency by shorter execution time.
  - Power consumption of DRAM internal component (red) increases proportionally
  - Power consumption of global I/O bus (light red) and I/O PHYs (light blue) considerably decreases









# Summary

- Generative AI requires **High-Bandwidth** and **High-Capacity** Memories.
- Memory vendors provide Various Memory Solutions to meet requirements.
  - HBM for server and LLW for edge and mobile
- □ Processing capability in Memory enables higher bandwidth and energy efficiency.
  - CIM, PIM and PNM are meaningful and heavily studied in school and industry.
  - **PIM**: Still lots of challenges to be solved for commercializing.
  - Specially, **LPDDR-PIM** is being prepared for on-device AI.
  - **Need strong collaboration** between system, processor, memory and software.