- Published on
Re: Implementation Series
- Authors

- Name
- Shuqi Wang
Welcome to my open notebook.
I've found that the best way to truly understand a concept is to build it from scratch. This series documents my journey as I re-implement influential AI research papers and models, line by line.
The goal isn't to create production-ready software, but to build a strong intuition for how these architectures actually work. I'm keeping my code minimal and pedagogical, focusing on clarity over optimization.
Episode 01: Decoder-Only GPT
We start with the foundation of modern LLMs: the decoder-only transformer. Following Andrej Karpathy's nanoGPT, I build a character-level language model from the ground up, implementing self-attention, multi-head attention, and the standard transformer block in PyTorch.
Episode 02: Mixture of Experts (MoE)
Building on our basic GPT, I implement a Sparse Mixture of Experts architecture. I explore how to scale model capacity without exploding inference costs by using conditional computation, noisy top-k gating, and auxiliary loss for load balancing.
Upcoming
I'm currently exploring Vision Transformers (ViT) and Diffusion Models. More notes will be added here as I finish them.
If you find these notes helpful or spot any errors in my implementation, feel free to reach out. We learn better together.