burtenshaw
commited on
Commit
·
3bc836e
1
Parent(s):
80b577a
Anton's comments
Browse files
app/src/content/article.mdx
CHANGED
|
@@ -1,6 +1,6 @@
|
|
| 1 |
---
|
| 2 |
title: "Porting nanochat to Transformers: an AI modeling history lesson"
|
| 3 |
-
subtitle: "There is a lot
|
| 4 |
description: "**tldr:** There is a lot t learn about ML from nanochat, and even more to learn about the history of the transformer architecture."
|
| 5 |
authors:
|
| 6 |
- name: "Ben Burtenshaw"
|
|
@@ -137,7 +137,7 @@ If we review a model in `transformers`, we can review both sides and learn from
|
|
| 137 |
|
| 138 |
## Why do we need nanochat in transformers?
|
| 139 |
|
| 140 |
-
It might seem counterintuitive to support an educational model like nanochat in a production grade library like `transformers`. After all, we can see from nanochat's benchmark scores that it does not rival state of the art models like Qwen3, SmolLM3, Gemma3, or [Olmo3](https://huggingface.co/allenai/Olmo-3-32B-Think). In fact, that's the reason we think nanochat should be in `transformers`. Here's what the community gains from its inclusion:
|
| 141 |
|
| 142 |
- `transformers` as a single source of truth teaches us about `nanochat`'s lineage.
|
| 143 |
- we can use the `nanochat` model in other libraries.
|
|
@@ -218,7 +218,7 @@ By returning loss directly when targets are provided, the training loop becomes
|
|
| 218 |
|
| 219 |
<Sidenote>
|
| 220 |
|
| 221 |
-
The [BaseModelOutputWithPast](https://huggingface.co/docs/transformers/en/main_classes/output#transformers.modeling_outputs.BaseModelOutputWithPast) class standardizes model outputs across the ecosystem.
|
| 222 |
|
| 223 |
</Sidenote>
|
| 224 |
|
|
@@ -471,7 +471,7 @@ class GPT(nn.Module):
|
|
| 471 |
|
| 472 |
</Sidenote>
|
| 473 |
|
| 474 |
-
In the modular implementation, we inherit from `Gemma2ForCausalLM
|
| 475 |
|
| 476 |
```py
|
| 477 |
class NanoChatForCausalLM(Gemma2ForCausalLM):
|
|
@@ -526,7 +526,7 @@ The [GQA paper](https://arxiv.org/abs/2305.13245) explains how grouped-query att
|
|
| 526 |
|
| 527 |
</Sidenote>
|
| 528 |
|
| 529 |
-
NanoChat uses Multi-Query Attention (MQA) to reduce the memory footprint of the KV cache, using
|
| 530 |
|
| 531 |
<Sidenote>
|
| 532 |
|
|
|
|
| 1 |
---
|
| 2 |
title: "Porting nanochat to Transformers: an AI modeling history lesson"
|
| 3 |
+
subtitle: "There is a lot to learn about ML from nanochat, and even more to learn about the history of the transformer architecture."
|
| 4 |
description: "**tldr:** There is a lot t learn about ML from nanochat, and even more to learn about the history of the transformer architecture."
|
| 5 |
authors:
|
| 6 |
- name: "Ben Burtenshaw"
|
|
|
|
| 137 |
|
| 138 |
## Why do we need nanochat in transformers?
|
| 139 |
|
| 140 |
+
It might seem counterintuitive to support an educational model like nanochat in a production grade library like `transformers`. After all, we can see from nanochat's benchmark scores that it does not rival state of the art models like [Qwen3](https://huggingface.co/collections/Qwen/qwen3), [SmolLM3](https://huggingface.co/collections/HuggingFaceTB/smollm3), [Gemma3](https://huggingface.co/collections/google/gemma-3-release), or [Olmo3](https://huggingface.co/allenai/Olmo-3-32B-Think). In fact, that's the reason we think nanochat should be in `transformers`. Here's what the community gains from its inclusion:
|
| 141 |
|
| 142 |
- `transformers` as a single source of truth teaches us about `nanochat`'s lineage.
|
| 143 |
- we can use the `nanochat` model in other libraries.
|
|
|
|
| 218 |
|
| 219 |
<Sidenote>
|
| 220 |
|
| 221 |
+
The [BaseModelOutputWithPast](https://huggingface.co/docs/transformers/en/main_classes/output#transformers.modeling_outputs.BaseModelOutputWithPast) class standardizes model outputs across the ecosystem. Base models return raw logits—loss calculation is delegated to wrapper modules like `ForCausalLM`. You'll often see `if labels is not None: loss = self.loss_function(...)` rather than just using `nn.cross_entropy`. This seemingly roundabout approach exists because of potential [gradient accumulation bugs](https://unsloth.ai/blog/gradient) that forced a rethink of how loss is computed depending on the trainer context.
|
| 222 |
|
| 223 |
</Sidenote>
|
| 224 |
|
|
|
|
| 471 |
|
| 472 |
</Sidenote>
|
| 473 |
|
| 474 |
+
In the modular implementation, we inherit from [`Gemma2ForCausalLM`](https://huggingface.co/docs/transformers/en/model_doc/gemma2). Gemma 2 also used untied weights and advanced output structures. By simply inheriting the class, we pull in all the necessary machinery for causal generation, while the configuration object (defined elsewhere) ensures the weights remain untied. Though Gemma 2 ties weights by default, we inherit primarily for code structure alignment and softcapping support—the `tie_word_embeddings` config flag controls the behavior, with `_tied_weights_keys` defining the mapping if applied:
|
| 475 |
|
| 476 |
```py
|
| 477 |
class NanoChatForCausalLM(Gemma2ForCausalLM):
|
|
|
|
| 526 |
|
| 527 |
</Sidenote>
|
| 528 |
|
| 529 |
+
NanoChat uses Multi-Query Attention (MQA) to reduce the memory footprint of the KV cache, using 6 query heads but only 6 key/value heads (in the default config). This is a common configuration for smaller models like nanochat.
|
| 530 |
|
| 531 |
<Sidenote>
|
| 532 |
|