

## ETHzürich MLWeaving: A One-Size-Fits-All System for Any-precision Learning Stochastic Gradient Decret (SCD). Zeke Wang, Kaan Kara, Hantian Zhang, Gustavo Alonso, Opur Mutlu Ce Zhang **Problem:** Linear model training using low-precision SGD. epochs\*/

# **Existing Approach:**

- 1, One hardware design for each precision.
- 2, One quantized dataset for each precision.

# **Our Approach (MLWwaving):**

One hardware design and one memory layout for any precision.

For i = 1 to N do /\*N samples  $(\vec{a}_i, b_i)^*/$ ax =  $Q_s(\vec{a}_i)^* \vec{x};$ /\*dot product\*/ scale = γ\*df(ax, b<sub>i</sub>); /\*serial part\*/  $\vec{g}$  = scale\*Q<sub>s</sub>( $\vec{a}_i$ ); /\*gradient comp\*/

\*model

### **MLWeaving: Software/Hardware Co-design**

#### MLWeaving memory layout (software)

1, Original full-precision fixed-point table for training dataset a



### MLWeaving arithmetic (hardware)

 $\overline{x} = \overline{x} - g;$ 

update\*/



 $\vec{\mathbf{x}}$ : model,

 $\vec{g}$ : gradient,

loss function

Γ: learning rate,

*df*: derivative of

### Experiment

"32-bit"

over

Speedup

- Hardware: an Intel Broadwell CPU (14 cores, 35MB LLC, 60 GB/s memory bandwidth) and an Intel Arria 10 FPGA (directly access the CPU memory via one QPI and two PCIe, with memory bandwidth: 15GB/s.
- **Dataset**: Epsilon (40,000 samples, 2000 features).
- **Hogwild (ModelAverage)**: state-of-the-art parallel implementations of

### Findings:

1, MLWeaving can roughly achieve linear speedup (time or memory traffic), when a lower number of bits is used, as shown in Figures a, b. 2, MLWeaving on an FPGA can achieve 11X

#### SGD on CPUs, using 14 cores, AVX2 and 8-bit dataset.

speedup over its CPU rivals in Figure c.

