Building upon our basic GPT, we now implement a Sparse Mixture of Experts (MoE) architecture. This allows us to scale up model capacity (parameters) without proportionally increasing computational cost (FLOPs) during inference.
My open notebook for mastering AI building blocks.
Ep.01 focuses on the pretraining phase of a decoder-only, character-level GPT architecture, built from first principles.