HW4 - Distributed Matrix Multiplication on WSE-2
Overview
This assignment is an introduction to programming on Cerebras’s Wafer-Scale Engine 2 (WSE-2) using the Cerebras Software Language (CSL).
In this assignment, you will implement distributed matrix multiplication $C = A \cdot B$ on a 2D grid of processing elements, where $A \in \mathbb{R}^{M \times H}$, $B \in \mathbb{R}^{H \times N}$, $C \in \mathbb{R}^{M \times N}$. All elements are fp32. The PE grid has kernel_x_dim columns and kernel_y_dim rows.
The github repository for this assignment is here.
Algorithm
The algorithm proceeds in three phases:
Phase 1 – Column Broadcast
Each part of $B$ (a vector) is broadcast from a single PE in a column to all other PEs in that column. The broadcast is split into a sending-up route and a common broadcast-down route. Because two directions cannot share the same color, the wavelets must change color at the head PE. PEs take turns sending: the top PE (y=0) sends first, then y=1, and so on, controlled by a fabric switch sentinel that advances after each PE finishes transmitting.
Column x Routing
PE(x,0) <-- color swap --> broadcast_rx_color sent NORTH is
| at head PE reflected back SOUTH as broadcast_tx_color
v
PE(x,1) -- forward SOUTH
|
v
PE(x,2) -- terminal (no further forwarding)
After PE(x,0) finishes, a SWITCH_ADV control wavelet
shifts the source to PE(x,1), then PE(x,2), etc.
Each wavelet carries one f32 element. All kernel_y_dim PEs in the column transmit sequentially, so every PE receives the B-column data from all other PEs in the column.
Phase 2 – Local SAXPY
Upon receiving each broadcast wavelet $b_h$ (a scalar from $B$), the PE executes:
\[r \mathrel{+}= A_j \cdot b_h\]where $r$ is the local accumulation buffer red_in, using the hardware fmac (fused multiply-accumulate) operation. After the PE has accumulated enough data, the partial sum is reduced along the row.
Phase 3 – Row Reduction (Ring)
The partial vectors red_in of length $dM$ are reduced (summed) across the X-dimension using a ring. The ring physically sends data EAST and returns WEST via flyover, forming a cycle through all PEs.
Ring across X-dimension (kernel_x_dim = 4 example):
State rotation for successive reduce() calls:
Call # PE0 PE1 PE2 PE3 Chain (ISUM -> ... -> FSUM)
----- ------ ------ ------ ------ ---------------------------
0 FSUM ISUM PSUM PSUM PE1 -> PE2 -> PE3 -> PE0
1 PSUM FSUM ISUM PSUM PE2 -> PE3 -> PE0 -> PE1
2 PSUM PSUM FSUM ISUM PE3 -> PE0 -> PE1 -> PE2
3 ISUM PSUM PSUM FSUM PE0 -> PE1 -> PE2 -> PE3
ISUM: Initiator -- send local red_in to fabric (no receive)
PSUM: Partial -- receive from fabric, add local red_in, forward
FSUM: Final -- receive from fabric, add local red_in, write to C
Each PE maintains a state machine that cycles through ISUM, PSUM, and FSUM states. After kernel_x_dim reductions, each PE has written one column to its local $C$ block via the rotating FSUM assignment.
Due Date: Thursday March 27th 2026 at 11:59 PM
Instructions
Teams
Note that you will work individually for this assignment.
Setting Up the SDK
This assignment uses the Cerebras SDK. We have placed a version of the SDK on Perlmutter, available at /global/cfs/cdirs/m4341/cs6230/homework26-sdk-1.4.0. To setup and install the SDK, first run:
student@perlmutter: /global/cfs/cdirs/m4341/cs6230/homework26-sdk-1.4.0/install.sh
then add the SDK to your path:
student@perlmutter: export PATH=$PATH:/global/cfs/cdirs/m4341/cs6230/homework26-sdk-1.4.0
verify this has worked by running:
student@perlmutter: cs_python
which should open a Python REPL if everything is working correctly.
What You Must Implement
You have three files to modify:
1. config.py — Data Layout Booleans
Set six boolean flags that control how matrices are distributed across PEs and stored in memory. The host-side infrastructure (prepare_h2d / reconstruct_d2h) in run.py imports these flags and uses them to chunk and reassemble the matrices automatically.
*_GLOBAL_TRANSPOSE: controls whether a matrix’s row axis maps to Y-PEs (False) or X-PEs (True).*_MEMORY_TRANSPOSE: controls whether each PE stores its local block transposed in memory.
Think carefully about which matrix dimension must align with the reduction axis (X), and which memory layout gives contiguous access for column-vector operations.
2. layout.csl — Fabric Routing
Inside the layout { } block, implement:
- Tile code assignment: Use
@set_tile_codeto assignpe.cslto each PE with the appropriate parameters. - Reduction ring routing: Configure
@set_color_configfor each PE so that partial sums flow from the last X-PE toward the first, with flyover colors for middle PEs. - Broadcast routing: Configure
@set_color_configso that each PE in a column can broadcast its B data to all other PEs in the same column.
3. pe.csl — On-PE Computation
Implement the per-PE logic:
- Broadcast receive: A data task that fires on each incoming wavelet, accumulates partial products via SAXPY (
fmac), and triggers reduction when a full B-column has been received. - Reduction state machine: A ring reduction across the X-dimension. PEs rotate through three roles (initiator, partial-sum, final-accumulator) across successive columns.
- Broadcast send: Send the local B data to the fabric, then send a control wavelet (
SWITCH_ADV) to hand off to the next PE. - Completion: Call
sys_mod.unblock_cmd_stream()when all reductions finish.
Please do not modify any files besides config.py, layout.csl, and pe.csl.
Running the Code
# Single configuration
./run.sh <sizeX> <sizeY> <M> <H> <N>
# All test configurations
./test_configs.sh
# Debug mode (fabric traces)
./run.sh 2 2 4 4 4 --debug
# Check performance
./test_perf.sh
Constraints
kernel_x_dim >= 2,kernel_y_dim >= 2- $M$ mod
kernel_y_dim$= 0$ - $H$ mod
kernel_x_dim$= 0$ - $N$ mod
kernel_x_dim$= 0$ - $N$ mod
kernel_y_dim$= 0$
Execution Flow
Host Device (all PEs)
---- ----------------
memcpy A, B to PEs
|
v
launch broadcast_pe() ---> initialize_reduce_states based on tile position
broadcast B
|
+-- recv wavelet --> SAXPY
+-- after enough wavelets --> reduce()
+-- after all N columns --> terminate_kernel()
|
v
memcpy C from PEs
verify C == A * B
Total reductions per PE: kernel_y_dim $\times$ dN_y $= N$, one per global column of $C$.
Grading
IMPORTANT: There is a planned Perlmutter monthly maintenance between March 18th and March 19th. Please factor this into your time management for this assignment.
Your score out of 100 points will be broken down into the following categories
-
20/100 points: Writeup
-
10/100 points: Checkpoint
-
20/100 points: Leaderboard Submission
-
50/100 points: Runtime Performance
Your score on the “Runtime Performance” part is determined by the Runtime value emitted by the test_perf.sh script provided in the repo. Similar to HW3, this part will be graded on a curve determined by the performance of everyone in the class.
As with HW3, there is a leaderboard for this homework that you can use to determine where you might land on the curve. The leaderboard is available here.
Additionally, if your implementation attains a runtime below a certain threshold, we will guarantee at least a B+ on the 50 Runtime Performance points. The threshold for at least a B+ is 428000 cycles as reported by test_perf.sh.
Checkpoint
The checkpoint for this assignment requires you to pass 5/20 of the correctness tests in test_configs.sh. You can pick any 5 you’d like.
The checkpoint is due on March 20th, 2026 at 11:59 PM ET.
Leaderboard Submission
Before the deadline on March 27th, you must submit once to the leaderboard. Before submitting to the leaderboard, you must pass all the correctness tests in test_configs.sh. The leaderboard will check to make sure you pass all tests before allowing your submission.
Leaderboard submission follows the same basic process as in HW3. There is a job-leaderboard script that you must run. You can submit an output file containing the output of job-leaderboard to the actual leaderboard using submit.py <name-of-output-file>. For the leaderboard Canvas assignment, you should submit a .gz file containing the output of job-leaderboard.
Submission Details
-
Make sure you have our most updated source code on Perlmutter.
-
Make sure you have only modified the files
layout.csl,pe.csl, andconfig.py. -
Ensure that your write-up pdf is located in your source directory. It should be named CS5220Group
_hw4.pdf. -
From your build directory, run:
student@perlmutter:~/hw4> ./package.sh <GROUPID>
- Download and submit your .tar.gz through canvas.
Writeup Details
Your write-up should contain:
-
your name, cornell id (NetID), and perlmutter username,
-
A description of your implementation approach
-
A table containing the statstics emitted by
test_perf.shevaluating the efficiency of your implementation -
GenAI and collaboration disclosure, similar to the other homeworks.
Unlike other homeworks, this writeup should be 1 page, double columns, using the ACM conference latex template.
Tips
Debugging
Simprint (<simprint>)
The <simprint> library lets you print values directly to the simulator log (sim.log) during execution. A helper function simprint_pe_coords() is provided in pe.csl. Call it before any print to identify the PE:
simprint_pe_coords();
prt.print_string("hello from this PE\n");
simprint_pe_coords();
prt.fmt_no_nl("current_row={d}\n", .{@as(u16, current_row)});
simprint_pe_coords();
prt.fmt_no_nl("reduce_count={d}, C[0]={f}\n", .{reduce_count, C[0]});
Important: Output is flushed to sim.log only when "\n" is encountered. Always end your format strings with \n.
Reading sim.log
Each line in sim.log is prefixed with the cycle number and the absolute fabric coordinates of the PE:
@968 P4.1: Loc(0, 0) P4.1: sender beginning main_fn
@1156 P5.1: Loc(1, 0) P5.1: recv_task: in_data = 0, global = 0
@1888 P6.2: Loc(2, 1) P6.2: recv_task: in_data = 4, global = 20
The absolute coordinates differ from the logical ones by the fabric offsets (typically off_x=4, off_y=1), so Loc(0,0) corresponds to P4.1.
Extracting Per-PE Logs
Use extract_logs.sh to split sim.log into per-PE files:
./extract_logs.sh <sizeX> <sizeY>
Debug Mode (Fabric Traces)
Run with --debug to enable fabric landing/router traces in sim.log:
./run.sh 2 2 4 4 4 --debug
File Structure
| File | You Edit? | Description |
|---|---|---|
pe.csl |
Yes | Per-PE kernel written in CSL. Core computation logic. |
layout.csl |
Yes | Fabric topology and routing configuration. |
config.py |
Yes | Six boolean flags controlling matrix distribution and memory layout. |
run.py |
No | Host-side Python driver. Generates matrices, distributes them, launches kernel, verifies correctness. Reset by the autograder. |
run.sh |
No | Compile-and-run wrapper. |
test_configs.sh |
No | Automated test sweep over ~30 configurations. |
test_perf.sh |
No | Performance benchmark. |
clean.sh |
No | Removes all build artifacts and logs. |
extract_logs.sh |
No | Splits sim.log into per-PE log files. |
package.sh |
No | Creates tar.gz for submission to canvas |
Recommended Tutorials
Study these examples before starting:
topic-06-switches— fabric switches and SWITCH_ADVtopic-14-color-swap— color swap routinggemv-06-routes-1throughgemv-08-routes-3— fabric routing patternstopic-11-collectives— broadcast and reduce patternssdklayout-02-routing—@set_color_configAPI