33< head >
44 < meta charset ="UTF-8 " />
55 < meta name ="viewport " content ="width=device-width, initial-scale=1.0 " />
6+ < meta name ="description " content ="Interactive demo of Attention Residuals — replacing fixed residual connections with learned softmax attention over depth. Built with Rust + WASM. " />
7+ < meta name ="theme-color " content ="#2563eb " media ="(prefers-color-scheme: light) " />
8+ < meta name ="theme-color " content ="#60a5fa " media ="(prefers-color-scheme: dark) " />
69 < title > Attention Residuals — Interactive Demo</ title >
710 < link rel ="preconnect " href ="https://fonts.googleapis.com " />
811 < link rel ="preconnect " href ="https://fonts.gstatic.com " crossorigin />
912 < link href ="https://fonts.googleapis.com/css2?family=Inter:wght@300;400;500;600;700&family=JetBrains+Mono:wght@400;500&family=Source+Serif+4:ital,wght@0,400;0,600;0,700;1,400&display=swap " rel ="stylesheet " />
1013 < link rel ="stylesheet " href ="/src/style.css " />
1114</ head >
1215< body >
16+ <!-- Skip to content for keyboard users -->
17+ < a href ="#demo " class ="skip-link "> Skip to interactive demo</ a >
18+
1319 <!-- ─── Navigation ──────────────────────────────────────────── -->
14- < nav class ="nav ">
20+ < nav class ="nav " role =" navigation " aria-label =" Main navigation " >
1521 < div class ="nav-inner ">
16- < a href ="# " class ="nav-logo ">
17- < span class ="nav-logo-symbol "> α</ span >
22+ < a href ="#top " class ="nav-logo " aria-label =" AttnRes — back to top ">
23+ < span class ="nav-logo-symbol " aria-hidden =" true " > α</ span >
1824 < span > AttnRes</ span >
1925 </ a >
20- < div class ="nav-links ">
21- < a href ="#problem "> Problem</ a >
22- < a href ="#algorithm "> Algorithm</ a >
23- < a href ="#demo "> Live Demo</ a >
24- < a href ="#training "> Training</ a >
25- < a href ="https://github.com/AbdelStark/attnres-rs " target ="_blank " rel ="noopener "> GitHub</ a >
26+ < button class ="nav-toggle " aria-expanded ="false " aria-controls ="nav-links " aria-label ="Toggle navigation menu ">
27+ < svg viewBox ="0 0 24 24 " fill ="none " stroke ="currentColor " stroke-width ="2 " stroke-linecap ="round " aria-hidden ="true ">
28+ < line x1 ="3 " y1 ="6 " x2 ="21 " y2 ="6 " />
29+ < line x1 ="3 " y1 ="12 " x2 ="21 " y2 ="12 " />
30+ < line x1 ="3 " y1 ="18 " x2 ="21 " y2 ="18 " />
31+ </ svg >
32+ </ button >
33+ < div class ="nav-links " id ="nav-links " role ="list ">
34+ < a href ="#problem " role ="listitem "> Problem</ a >
35+ < a href ="#algorithm " role ="listitem "> Algorithm</ a >
36+ < a href ="#demo " role ="listitem "> Live Demo</ a >
37+ < a href ="#training " role ="listitem "> Training</ a >
38+ < a href ="#comparison " role ="listitem "> Comparison</ a >
39+ < a href ="https://github.com/AbdelStark/attnres-rs " target ="_blank " rel ="noopener " role ="listitem "> GitHub</ a >
2640 </ div >
2741 </ div >
2842 </ nav >
@@ -40,8 +54,8 @@ <h1 class="hero-title">Attention Residuals</h1>
4054 Paper: < em > Attention as a Hypernetwork</ em > (MoonshotAI / Kimi) ·
4155 Implementation: < code > attnres-rs</ code > (burn framework)
4256 </ p >
43- < div class ="hero-status " id ="wasm-status ">
44- < span class ="status-dot loading "> </ span >
57+ < div class ="hero-status " id ="wasm-status " role =" status " aria-live =" polite " >
58+ < span class ="status-dot loading " aria-hidden =" true " > </ span >
4559 < span > Loading WASM engine…</ span >
4660 </ div >
4761 </ div >
@@ -57,7 +71,7 @@ <h2>The Problem with Standard Residuals</h2>
5771 < p >
5872 In standard Transformers, the residual connection is a simple addition:
5973 </ p >
60- < div class ="equation ">
74+ < div class ="equation " role =" math " aria-label =" h sub l plus 1 equals h sub l plus F sub l of h sub l " >
6175 h< sub > l+1</ sub > = h< sub > l</ sub > + F< sub > l</ sub > (h< sub > l</ sub > )
6276 </ div >
6377 < p >
@@ -84,7 +98,7 @@ <h2>The Problem with Standard Residuals</h2>
8498 < div class ="col-viz ">
8599 < div class ="diagram " id ="standard-residual-diagram ">
86100 < div class ="diagram-title "> Standard Residual</ div >
87- < canvas id ="canvas-standard " width ="320 " height ="400 "> </ canvas >
101+ < canvas id ="canvas-standard " width ="320 " height ="400 " aria-label =" Diagram showing standard residual connections with equal +1 weights between layers " > </ canvas >
88102 < div class ="diagram-caption ">
89103 All layers contribute equally (weight = 1).
90104 < br /> No selectivity over depth.
@@ -101,74 +115,74 @@ <h2>The Problem with Standard Residuals</h2>
101115 < div class ="section-label "> 02</ div >
102116 < h2 > Attention Residuals: The Algorithm</ h2 >
103117
104- < div class ="algo-steps ">
105- < div class ="algo-step ">
106- < div class ="algo-step-num "> 1</ div >
118+ < div class ="algo-steps " role =" list " >
119+ < div class ="algo-step " role =" listitem " >
120+ < div class ="algo-step-num " aria-hidden =" true " > 1</ div >
107121 < div class ="algo-step-content ">
108122 < h3 > Stack block representations</ h3 >
109123 < p >
110124 Collect all completed block sums < strong > b< sub > 0</ sub > , …, b< sub > n-1</ sub > </ strong >
111125 plus the current partial block into a value matrix.
112126 </ p >
113- < div class ="equation ">
127+ < div class ="equation " role =" math " >
114128 V = [b< sub > 0</ sub > ; b< sub > 1</ sub > ; …; b< sub > n</ sub > < sup > (partial)</ sup > ]  ∈  ℝ< sup > (N+1) × D</ sup >
115129 </ div >
116130 </ div >
117131 </ div >
118132
119- < div class ="algo-step ">
120- < div class ="algo-step-num "> 2</ div >
133+ < div class ="algo-step " role =" listitem " >
134+ < div class ="algo-step-num " aria-hidden =" true " > 2</ div >
121135 < div class ="algo-step-content ">
122136 < h3 > Normalize keys with RMSNorm</ h3 >
123137 < p >
124138 Prevent large-magnitude blocks from dominating attention logits.
125139 Without this, deeper blocks (which accumulate more layer outputs)
126140 would receive disproportionate weight.
127141 </ p >
128- < div class ="equation ">
142+ < div class ="equation " role =" math " >
129143 K = RMSNorm(V) = (V / √mean(V²)) · γ
130144 </ div >
131145 </ div >
132146 </ div >
133147
134- < div class ="algo-step ">
135- < div class ="algo-step-num "> 3</ div >
148+ < div class ="algo-step " role =" listitem " >
149+ < div class ="algo-step-num " aria-hidden =" true " > 3</ div >
136150 < div class ="algo-step-content ">
137151 < h3 > Compute depth attention logits</ h3 >
138152 < p >
139153 A learned pseudo-query < strong > w< sub > l</ sub > </ strong > ∈ ℝ< sup > D</ sup >
140154 scores each block. Crucially, w is < strong > initialized to zero</ strong > —
141155 ensuring the model starts as a standard residual and smoothly transitions.
142156 </ p >
143- < div class ="equation ">
157+ < div class ="equation " role =" math " >
144158 logits< sub > i</ sub > = K< sub > i</ sub > · w< sub > l</ sub >    ∀ i ∈ {0, …, N}
145159 </ div >
146160 </ div >
147161 </ div >
148162
149- < div class ="algo-step ">
150- < div class ="algo-step-num "> 4</ div >
163+ < div class ="algo-step " role =" listitem " >
164+ < div class ="algo-step-num " aria-hidden =" true " > 4</ div >
151165 < div class ="algo-step-content ">
152166 < h3 > Softmax over < em > depth</ em > </ h3 >
153167 < p >
154168 The softmax is taken < strong > over the block/depth dimension</ strong > , not the
155169 sequence dimension. This is attention over < em > layers</ em > , not over < em > tokens</ em > .
156170 </ p >
157- < div class ="equation ">
171+ < div class ="equation " role =" math " >
158172 α< sub > i</ sub > = softmax(logits)< sub > i</ sub > = exp(logits< sub > i</ sub > ) / ∑< sub > j</ sub > exp(logits< sub > j</ sub > )
159173 </ div >
160174 </ div >
161175 </ div >
162176
163- < div class ="algo-step ">
164- < div class ="algo-step-num "> 5</ div >
177+ < div class ="algo-step " role =" listitem " >
178+ < div class ="algo-step-num " aria-hidden =" true " > 5</ div >
165179 < div class ="algo-step-content ">
166180 < h3 > Weighted combination</ h3 >
167181 < p >
168182 The output is a learned convex combination of all block representations.
169183 Each layer can choose exactly how much information to draw from each depth.
170184 </ p >
171- < div class ="equation ">
185+ < div class ="equation " role =" math " >
172186 h = ∑< sub > i</ sub > α< sub > i</ sub > · V< sub > i</ sub >
173187 </ div >
174188 </ div >
@@ -198,18 +212,18 @@ <h2>Interactive: Core AttnRes Operation</h2>
198212 < div class ="demo-panel ">
199213 < div class ="demo-controls ">
200214 < div class ="control-group ">
201- < label > Model Configuration</ label >
202- < div class ="control-row ">
215+ < label id =" config-label " > Model Configuration</ label >
216+ < div class ="control-row " role =" group " aria-labelledby =" config-label " >
203217 < div class ="control ">
204- < span class ="control-label "> d_model</ span >
218+ < label class ="control-label " for =" cfg-d-model " > d_model</ label >
205219 < select id ="cfg-d-model ">
206220 < option value ="16 "> 16</ option >
207221 < option value ="32 " selected > 32</ option >
208222 < option value ="64 "> 64</ option >
209223 </ select >
210224 </ div >
211225 < div class ="control ">
212- < span class ="control-label "> Layers (sublayers)</ span >
226+ < label class ="control-label " for =" cfg-layers " > Layers (sublayers)</ label >
213227 < select id ="cfg-layers ">
214228 < option value ="4 "> 4</ option >
215229 < option value ="8 " selected > 8</ option >
@@ -218,14 +232,14 @@ <h2>Interactive: Core AttnRes Operation</h2>
218232 </ select >
219233 </ div >
220234 < div class ="control ">
221- < span class ="control-label "> Blocks</ span >
235+ < label class ="control-label " for =" cfg-blocks " > Blocks</ label >
222236 < select id ="cfg-blocks ">
223237 < option value ="2 " selected > 2</ option >
224238 < option value ="4 "> 4</ option >
225239 </ select >
226240 </ div >
227241 < div class ="control ">
228- < span class ="control-label "> Heads</ span >
242+ < label class ="control-label " for =" cfg-heads " > Heads</ label >
229243 < select id ="cfg-heads ">
230244 < option value ="2 "> 2</ option >
231245 < option value ="4 " selected > 4</ option >
@@ -237,12 +251,13 @@ <h2>Interactive: Core AttnRes Operation</h2>
237251 </ div >
238252
239253 < div class ="control-group " id ="query-controls " style ="display:none ">
240- < label > Pseudo-Query Magnitude</ label >
254+ < label for =" query-magnitude " > Pseudo-Query Magnitude</ label >
241255 < p class ="control-hint ">
242256 Drag the slider to simulate w< sub > l</ sub > evolving away from zero during training.
243257 </ p >
244- < input type ="range " id ="query-magnitude " min ="0 " max ="100 " value ="0 " class ="slider " />
245- < div class ="slider-labels ">
258+ < input type ="range " id ="query-magnitude " min ="0 " max ="100 " value ="0 " class ="slider "
259+ aria-valuemin ="0 " aria-valuemax ="1 " aria-valuenow ="0 " aria-valuetext ="0.00 (uniform) " />
260+ < div class ="slider-labels " aria-hidden ="true ">
246261 < span > 0.0 (uniform)</ span >
247262 < span id ="query-mag-display "> 0.00</ span >
248263 < span > 1.0 (selective)</ span >
@@ -258,7 +273,7 @@ <h2>Interactive: Core AttnRes Operation</h2>
258273 < div class ="result-card result-card-wide ">
259274 < div class ="result-card-header "> Depth Attention Weights</ div >
260275 < div class ="result-card-body ">
261- < canvas id ="canvas-heatmap " width ="800 " height ="300 "> </ canvas >
276+ < canvas id ="canvas-heatmap " width ="800 " height ="300 " aria-label =" Heatmap showing depth attention weights across sublayers and source blocks " > </ canvas >
262277 </ div >
263278 < div class ="result-card-footer ">
264279 Rows: sublayers (Attn/MLP at each transformer layer). Columns: source blocks.
@@ -268,7 +283,7 @@ <h2>Interactive: Core AttnRes Operation</h2>
268283 < div class ="result-card ">
269284 < div class ="result-card-header "> Attention Distribution</ div >
270285 < div class ="result-card-body ">
271- < canvas id ="canvas-bar " width ="400 " height ="250 "> </ canvas >
286+ < canvas id ="canvas-bar " width ="400 " height ="250 " aria-label =" Bar chart of attention weight distribution for the deepest sublayer " > </ canvas >
272287 </ div >
273288 < div class ="result-card-footer ">
274289 At zero init, all sources receive weight 1/N (uniform). Training breaks this symmetry.
@@ -292,16 +307,16 @@ <h2>Training: Watching Patterns Emerge</h2>
292307
293308 < div class ="training-panel ">
294309 < div class ="training-controls ">
295- < button class ="btn btn-primary " id ="btn-train-start " disabled > Start Training</ button >
296- < button class ="btn " id ="btn-train-reset " disabled > Reset</ button >
297- < div class ="training-stats ">
310+ < button class ="btn btn-primary " id ="btn-train-start " disabled aria-label =" Start training simulation " > Start Training</ button >
311+ < button class ="btn " id ="btn-train-reset " disabled aria-label =" Reset training to initial state " > Reset</ button >
312+ < div class ="training-stats " role =" group " aria-label =" Training statistics " >
298313 < div class ="stat ">
299314 < span class ="stat-label "> Step</ span >
300- < span class ="stat-value " id ="train-step "> 0</ span >
315+ < span class ="stat-value " id ="train-step " aria-live =" off " > 0</ span >
301316 </ div >
302317 < div class ="stat ">
303318 < span class ="stat-label "> Loss</ span >
304- < span class ="stat-value " id ="train-loss "> —</ span >
319+ < span class ="stat-value " id ="train-loss " aria-live =" off " > —</ span >
305320 </ div >
306321 </ div >
307322 </ div >
@@ -310,13 +325,15 @@ <h2>Training: Watching Patterns Emerge</h2>
310325 < div class ="result-card result-card-wide ">
311326 < div class ="result-card-header "> Loss Curve</ div >
312327 < div class ="result-card-body ">
313- < canvas id ="canvas-loss " width ="800 " height ="200 "> </ canvas >
328+ < div class ="canvas-empty-state " id ="loss-empty "> Initialize a model and start training to see the loss curve</ div >
329+ < canvas id ="canvas-loss " width ="800 " height ="200 " style ="display:none " aria-label ="Training loss curve over steps "> </ canvas >
314330 </ div >
315331 </ div >
316332 < div class ="result-card result-card-wide ">
317333 < div class ="result-card-header "> Depth Attention Heatmap (evolving)</ div >
318334 < div class ="result-card-body ">
319- < canvas id ="canvas-train-heatmap " width ="800 " height ="300 "> </ canvas >
335+ < div class ="canvas-empty-state " id ="heatmap-empty "> Depth attention patterns will appear here during training</ div >
336+ < canvas id ="canvas-train-heatmap " width ="800 " height ="300 " style ="display:none " aria-label ="Evolving depth attention heatmap during training "> </ canvas >
320337 </ div >
321338 < div class ="result-card-footer ">
322339 Watch how later layers develop stronger selectivity over depth.
@@ -326,7 +343,8 @@ <h2>Training: Watching Patterns Emerge</h2>
326343 < div class ="result-card result-card-wide ">
327344 < div class ="result-card-header "> Pseudo-Query Norms ||w< sub > l</ sub > ||</ div >
328345 < div class ="result-card-body ">
329- < canvas id ="canvas-norms " width ="800 " height ="200 "> </ canvas >
346+ < div class ="canvas-empty-state " id ="norms-empty "> Pseudo-query norm evolution will appear here during training</ div >
347+ < canvas id ="canvas-norms " width ="800 " height ="200 " style ="display:none " aria-label ="Multi-line chart of pseudo-query norm evolution per sublayer "> </ canvas >
330348 </ div >
331349 < div class ="result-card-footer ">
332350 The magnitude of each pseudo-query grows from zero during training.
@@ -347,8 +365,8 @@ <h2>Standard Residual vs. AttnRes</h2>
347365 < div class ="comparison-grid ">
348366 < div class ="comparison-card ">
349367 < h3 > Standard Residual</ h3 >
350- < div class ="equation "> h = h< sub > l</ sub > + F(h< sub > l</ sub > )</ div >
351- < canvas id ="canvas-cmp-standard " width ="300 " height ="200 "> </ canvas >
368+ < div class ="equation " role =" math " > h = h< sub > l</ sub > + F(h< sub > l</ sub > )</ div >
369+ < canvas id ="canvas-cmp-standard " width ="300 " height ="200 " aria-label =" Bar chart showing uniform 0.25 weights for standard residual " > </ canvas >
352370 < ul >
353371 < li > Fixed weight = 1 per layer</ li >
354372 < li > No selectivity over depth</ li >
@@ -358,8 +376,8 @@ <h3>Standard Residual</h3>
358376 </ div >
359377 < div class ="comparison-card comparison-card-highlight ">
360378 < h3 > Attention Residual</ h3 >
361- < div class ="equation "> h = ∑ α< sub > i</ sub > · b< sub > i</ sub > </ div >
362- < canvas id ="canvas-cmp-attnres " width ="300 " height ="200 "> </ canvas >
379+ < div class ="equation " role =" math " > h = ∑ α< sub > i</ sub > · b< sub > i</ sub > </ div >
380+ < canvas id ="canvas-cmp-attnres " width ="300 " height ="200 " aria-label =" Bar chart showing learned non-uniform weights for attention residual " > </ canvas >
363381 < ul >
364382 < li > Learned weights via softmax</ li >
365383 < li > Selective routing over depth</ li >
@@ -393,6 +411,9 @@ <h3>Attention Residual</h3>
393411 </ div >
394412 </ footer >
395413
414+ <!-- Toast notification container -->
415+ < div class ="toast-container " id ="toast-container " aria-live ="polite " aria-atomic ="true "> </ div >
416+
396417 < script type ="module " src ="/src/main.ts "> </ script >
397418</ body >
398419</ html >
0 commit comments