Saguaro
Published on

Re: Implementation Series

Authors
  • avatar
    Name
    Shuqi Wang
    Twitter

Welcome to my open notebook.

I've found that the best way to truly understand a concept is to build it from scratch. This series documents my journey as I re-implement influential AI research papers and models, line by line.

The goal isn't to create production-ready software, but to build a strong intuition for how these architectures actually work. I'm keeping my code minimal and pedagogical, focusing on clarity over optimization.


Episode 01: Decoder-Only GPT

We start with the foundation of modern LLMs: the decoder-only transformer. Following Andrej Karpathy's nanoGPT, I build a character-level language model from the ground up, implementing self-attention, multi-head attention, and the standard transformer block in PyTorch.

Open In Colab

Episode 02: Mixture of Experts (MoE)

Building on our basic GPT, I implement a Sparse Mixture of Experts architecture. I explore how to scale model capacity without exploding inference costs by using conditional computation, noisy top-k gating, and auxiliary loss for load balancing.

Open In Colab


Upcoming

I'm currently exploring Vision Transformers (ViT) and Diffusion Models. More notes will be added here as I finish them.

If you find these notes helpful or spot any errors in my implementation, feel free to reach out. We learn better together.

Thanks for reading. Stay curious!