

2022.04.02

SK Hynix, Euicheol Lim





## CONTENTS

I • GDDR6-AiM

- II AI Services
- **III** Computational Memory
- **IV** Summary

MEMORY FOR ST

#### **MEMORY FOR EST © SK hynix Inc.** This material is proprietary of SK hynix Inc. and subject to change without notice./ Confidential

# **GDDR6-AiM Overview**

### Definition

GDDR6-AiM is a GDDR6-based Accelerator-in-Memory (AiM) device targeted for memory-intense Machine Learning algorithm (RNN, LSTM, MLP) inference acceleration by offloading certain mathematical operations (MAC, Activation Function, Element-Wise Multiplication) from the host (CPU, GPU, FPGA).

### **Conventional System**



### **In-Memory Accelerated System**





## **GDDR6-AiM Overview**

#### **Global Buffer**

- Supplementary 2 kB SRAM buffer.
- Provides vector data for MAC.
- Supports 32B WRITE operations.



### **Activation Module**

- Performs Activation Function (AF) computation by linearly Interpolating pre-stored AF template data using MAC calculation results.
- Activation results are stored in a dedicated AF REG set and can be later accessed by the user.



| ВКО                   | BK3        | BK4        | BK7        |
|-----------------------|------------|------------|------------|
| MAC                   | MAC        | MAC        | MAC        |
| Activation            | Activation | Activation | Activation |
| Activation            | Activation | Activation | Activation |
| MAC                   | MAC        | MAC        | MAC        |
| BK1                   | BK2        | BK5        | BK6        |
| GLOBAL<br>BUFFER PERI |            |            |            |
| BK8                   | BK11       | BK12       | BK15       |
| MAC                   | MAC        | MAC        | MAC        |
| Activation            | Activation | Activation | Activation |
| Activation            | Activation | Activation | Activation |
| MAC                   | MAC        | MAC        | MAC        |
| ВК9                   | ВК10       | BK13       | BK14       |

### Multiply-And-Accumulate (MAC)

• Performs MAC operation on **sixteen** bfloat16 weight and vector elements (corresponds to a single DRAM column access, i.e. 32 Bytes).

SK hynix

Computation results are stored in a dedicated
 MAC\_REG set and can be later accessed by the user.





## **GDDR6-AiM Operations**



- MAC and Activation Function operations can be performed in all banks in parallel.
- Weight data is sourced from Banks; Vector data is sourced from the Global Buffer.
- MAC results are stored in latches collectively referred to as MAC\_REG.

MFMORY

FOR

• Activation Function are stored in latches collectively referred to as AF\_REG.

#### In-Channel COPY

٠



• Global Buffer acts as FIFO register.

Read Copy fills the FIFO, Write Copy transfers FIFO contents to a bank.

#### **Element-Wise Multiplication**



- Underlying expression: c[i] = a[i] · b[i]
- One operation per **Bank Group** can be performed in parallel.



## CONTENTS

I • GDDR6-AiM

II • AI Services

- **III** Computational Memory
- **IV** Summary

MEMORY FOR ST

### **AI Service Business Case - Grab Driver**

- Adopt data/analytics/AI platforms due to growing needs for real-time/large-scale/AI-based analysis.
- [1] AI Modeling enables prediction, [2] DB/Analytics stores and analyzes key biz data



© SK hynix Inc. This material is proprietary of SK hynix Inc. and subject to change without notice./ Confidentia

FOR

## **On-premise vs. Cloud (IaaS, PaaS, SaaS)**

- On-premise enterprise → Center of gravity shifted to Cloud (IaaS, PaaS, SaaS) from 2010
- The cloud service is enhancing user convenience and strengthening AI/ML and Analytics PaaS

### laaS, PaaS & SaaS

- IaaS: HW resource provision → platform built by Service User
- **PaaS:** + Platform provision → Application development by Service User
- SaaS: + Application provision → Service built by Service User





#### Amazon Aurora Analytics €¥ Amazon Amazon WS LAKE FORMATIO EMR DynamoDB Amazon Athena Amazon **S**3 AI/ML AWS GLUE H Amazon Amazon Q Elasticsearch SageMaker Service 8~8 Amazon

### AI/ML, Analytics PaaS based on Data lake (AWS)

**SK** hynix

- [CSP] Opportunities for service/infrastructure optimization (cost reduction)
- [Service User] IT Cost/Business Reduction

Redshift

CSPs understand domain specific customer requirements and workloads



### Al Model → Embedding + DL Neural Network

### • Embedding

- To translate Real Space Data into Deep Learning Space Data
- Memory intensive function
  - **Real Space** Embedding **DL NeuralNet** Language Al engine in DL space **Deep Learning Space** Forward Image porsche Embeddi bmw Answe tesla Vector Image Backward Item
- Embedding intensive AI Service: Recommendation
- DL NN intensive AI Service: NLP, Vision, ...

**DL NeuralNet (Transformer/MLP...)** 

To do Neural Network in Deep Learning Space

Memory intensive + Computing intensive function



### Data pipeline for AI service

- Real-time data analytics and AI systems are built as pipelines for data processing using AI.
- In the future, architectural convergence between AI-Analytics is expected.





## CONTENTS

I • GDDR6-AiM

II • AI Services

## **III**• Computational Memory

**IV** • Summary

MEMORY FOR EST

## **Back to the Basic – Computational Memory**

### • Computational Memory Concept

• By performing some host operations on the memory side, energy efficiency and performance are improved



## **Check point : CXL -** Heterogeneous Interconnect

- CXL interconnect that connects Heterogeneous Computing Elements will become the center of Server System
  - CXL is an interconnect to support high speed connection between Host Processor and Accelerators/Memory Device.
  - CXL supports 3 protocols based on PCIe Gen5.0
    - 1) CXL.io, 2) CXL.cache, 3) CXL.mem
- Opportunity: Value added Memory solution available
  - Unlike conventional DIMMs, CXL-connected memory protocol enables handshaking communication, enabling additional functions on memory (ex, DRAM cache, Data processing engine...)
- With the advent of Memory-intensive Killer Application (AI) and Memory Semantic Interconnect (CXL), research and deploy of CXL memory-based Computational Memory is expected to accelerate.



SK hvnix

MFMORY

## Card level CM - CMS (CXL base Computational Memory Solution)

- Higher Performance by fully utilizing Memory BW + energy saving by data reduction + low cost high capacity
  - Computing core can efficiently handle data-intensive workloads by fully utilizing memory bandwidth in card
  - Data reduction in the cards can significantly improve energy consumption by data movement
  - Cost-effective scalability makes the system to easily scale-up and out without having to pay for expensive servers just to increase the number of memory channels



### Various CM solution (Characteristics & Challenges)



Die level CM - PIM



Card level CM - CMS



Storage level CM -CSD

| <ul> <li>Bank level parallelism</li> <li>Applied within one memory device die</li> <li>Small memory capacity per processing node</li> <li>Need to define Host interface and standardization</li> </ul>    |  |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|
| <ul> <li>Channel level parallelism</li> <li>Applied across multiple memory devices</li> <li>Larger memory capacity per processing node than PIM</li> <li>CXL interface available</li> </ul>               |  |
| <ul> <li>NAND flash level parallelism</li> <li>Applied across multiple NAND flash devices</li> <li>Larger memory capacity per processing node than CMS</li> <li>Block interface → KV interface</li> </ul> |  |

Workload analysis for actual use case

SK hynix

SW framework support such as Compiler, API, Library, device driver

### M E M O R Y FOR ♣EST

### **Roofline Analysis**

- Methodology that can analyze HW architecture suitable for a specific processing algorithm
  - W (Work) : # of operations performed by a given application
  - Q (Memory Traffic) : # of bytes of memory transfers incurred during execution of application
  - OI (Operational Intensity) = W/Q : # of operations per byte of memory traffic.
  - P (Attainable performance) = min ( $\pi$ ,  $\beta$  x OI) : In given HW,  $\pi$  is max processing performance,  $\beta$  is the max bandwidth



MEMORY

FOR

### Workload Analysis – Data Analytics, Embedding

- Representative Data Analytics functions have overall low operational intensity characteristics 
   → Memory bound
- The embedding operation is also a memory bound operation with very low operational intensity.
- So, if these operations are operated in the Computational Memory, performance and power gains can be obtained.





## Workload Analysis – DL Neural network

- Matrix multiplication, which has been the main target of PIM, is losing its memory-intensive characteristics as the batch increases and algorithm evolves.
- There is still an opportunity for offloading of memory-intensive functions regardless of batch size increase such as layer normalization and any kind of function for data itself.

**GEMV** 



• GEMV (related w/ weight)

small batch  $\rightarrow$  memory intensive, large batch  $\rightarrow$  computing intensive

Normalization, Optimizer (related w/ data)





- In Transformer, Memory intensive operation takes significant portion of the total execution time
- There is also memory intensive operation in Attention layer (softmax, biases, dropout)





## Workload Analysis – Workload density

Workload Density – How dense is the data to be processed in the memory

 $\rightarrow$  If the workload is sparse compared to the memory capacity per PE, data reduction per PE is reduced and frequent data movement between PEs is induced.

High workload density



MFMORY

FOR

**Deep learning NN** (MLP, CNN, Transformer...) **Data Analytics** 











CPU (log

over

Speedup

The higher workload density, the more appropriate for PIM CPU 🗉 🖾 GPU 640 DPUs Section 2556 DPUs a 1024.000 256.000 64.000 16.000 4.000 1.000 0.250 0.063 0.016 0.004 0.001 HST-S GMEAN (1) GMEAN (2) HST-L SCAN-RSS GMEAN ٨ SEL NN BS RED SCAN-SSA TRNS GEMV I≩ More PIM-suitable workloads (1 Less PIM-suitable work oads (2) Gómez-Luna, Juan, et al. "Benchmarking a New Paradigm: An Experimental Analysis of a Real Processing-in-Memory Architecture."(2021) Sparse Data, large data movement

SK hynix

among processing node.



Low workload density



## CONTENTS

I • GDDR6-AiM

II • AI Services

**III**• Computational Memory

**IV** • **Summary** 

M E M O R Y FOR ST

### **Summary**

- Value addition memory base solution can be deployed in whole data pipeline of AI/Data Analytics system
  - Die level CM : PIM
  - Card level CM : CMS
  - Storage level CM : CSD



# End of Document