burtenshaw commited on
Commit
3bc836e
·
1 Parent(s): 80b577a

Anton's comments

Browse files
Files changed (1) hide show
  1. app/src/content/article.mdx +5 -5
app/src/content/article.mdx CHANGED
@@ -1,6 +1,6 @@
1
  ---
2
  title: "Porting nanochat to Transformers: an AI modeling history lesson"
3
- subtitle: "There is a lot t learn about ML from nanochat, and even more to learn about the history of the transformer architecture."
4
  description: "**tldr:** There is a lot t learn about ML from nanochat, and even more to learn about the history of the transformer architecture."
5
  authors:
6
  - name: "Ben Burtenshaw"
@@ -137,7 +137,7 @@ If we review a model in `transformers`, we can review both sides and learn from
137
 
138
  ## Why do we need nanochat in transformers?
139
 
140
- It might seem counterintuitive to support an educational model like nanochat in a production grade library like `transformers`. After all, we can see from nanochat's benchmark scores that it does not rival state of the art models like Qwen3, SmolLM3, Gemma3, or [Olmo3](https://huggingface.co/allenai/Olmo-3-32B-Think). In fact, that's the reason we think nanochat should be in `transformers`. Here's what the community gains from its inclusion:
141
 
142
  - `transformers` as a single source of truth teaches us about `nanochat`'s lineage.
143
  - we can use the `nanochat` model in other libraries.
@@ -218,7 +218,7 @@ By returning loss directly when targets are provided, the training loop becomes
218
 
219
  <Sidenote>
220
 
221
- The [BaseModelOutputWithPast](https://huggingface.co/docs/transformers/en/main_classes/output#transformers.modeling_outputs.BaseModelOutputWithPast) class standardizes model outputs across the ecosystem.
222
 
223
  </Sidenote>
224
 
@@ -471,7 +471,7 @@ class GPT(nn.Module):
471
 
472
  </Sidenote>
473
 
474
- In the modular implementation, we inherit from `Gemma2ForCausalLM`. This is a powerful simplification—Gemma 2 also supports untied weights and advanced output structures. By simply inheriting the class, we pull in all the necessary machinery for causal generation, while the configuration object (defined elsewhere) ensures the weights remain untied:
475
 
476
  ```py
477
  class NanoChatForCausalLM(Gemma2ForCausalLM):
@@ -526,7 +526,7 @@ The [GQA paper](https://arxiv.org/abs/2305.13245) explains how grouped-query att
526
 
527
  </Sidenote>
528
 
529
- NanoChat uses Multi-Query Attention (MQA) to reduce the memory footprint of the KV cache, using 10 query heads but only 4 key/value heads (in the default config).
530
 
531
  <Sidenote>
532
 
 
1
  ---
2
  title: "Porting nanochat to Transformers: an AI modeling history lesson"
3
+ subtitle: "There is a lot to learn about ML from nanochat, and even more to learn about the history of the transformer architecture."
4
  description: "**tldr:** There is a lot t learn about ML from nanochat, and even more to learn about the history of the transformer architecture."
5
  authors:
6
  - name: "Ben Burtenshaw"
 
137
 
138
  ## Why do we need nanochat in transformers?
139
 
140
+ It might seem counterintuitive to support an educational model like nanochat in a production grade library like `transformers`. After all, we can see from nanochat's benchmark scores that it does not rival state of the art models like [Qwen3](https://huggingface.co/collections/Qwen/qwen3), [SmolLM3](https://huggingface.co/collections/HuggingFaceTB/smollm3), [Gemma3](https://huggingface.co/collections/google/gemma-3-release), or [Olmo3](https://huggingface.co/allenai/Olmo-3-32B-Think). In fact, that's the reason we think nanochat should be in `transformers`. Here's what the community gains from its inclusion:
141
 
142
  - `transformers` as a single source of truth teaches us about `nanochat`'s lineage.
143
  - we can use the `nanochat` model in other libraries.
 
218
 
219
  <Sidenote>
220
 
221
+ The [BaseModelOutputWithPast](https://huggingface.co/docs/transformers/en/main_classes/output#transformers.modeling_outputs.BaseModelOutputWithPast) class standardizes model outputs across the ecosystem. Base models return raw logits—loss calculation is delegated to wrapper modules like `ForCausalLM`. You'll often see `if labels is not None: loss = self.loss_function(...)` rather than just using `nn.cross_entropy`. This seemingly roundabout approach exists because of potential [gradient accumulation bugs](https://unsloth.ai/blog/gradient) that forced a rethink of how loss is computed depending on the trainer context.
222
 
223
  </Sidenote>
224
 
 
471
 
472
  </Sidenote>
473
 
474
+ In the modular implementation, we inherit from [`Gemma2ForCausalLM`](https://huggingface.co/docs/transformers/en/model_doc/gemma2). Gemma 2 also used untied weights and advanced output structures. By simply inheriting the class, we pull in all the necessary machinery for causal generation, while the configuration object (defined elsewhere) ensures the weights remain untied. Though Gemma 2 ties weights by default, we inherit primarily for code structure alignment and softcapping support—the `tie_word_embeddings` config flag controls the behavior, with `_tied_weights_keys` defining the mapping if applied:
475
 
476
  ```py
477
  class NanoChatForCausalLM(Gemma2ForCausalLM):
 
526
 
527
  </Sidenote>
528
 
529
+ NanoChat uses Multi-Query Attention (MQA) to reduce the memory footprint of the KV cache, using 6 query heads but only 6 key/value heads (in the default config). This is a common configuration for smaller models like nanochat.
530
 
531
  <Sidenote>
532