Gradient Descent
Gradient descent is the engine of machine learning. A model “learns” by repeatedly nudging its parameters in the direction that reduces error.
The update rule
Given parameters θ and a loss function L(θ) that measures how wrong the model is:
θ ← θ − η · ∇L(θ)
∇L(θ)is the gradient — the direction of steepest increase in loss.- We step in the opposite direction (the minus sign) to decrease loss.
η(eta) is the learning rate — how big each step is.
Repeat until the loss stops improving. That’s it. Everything else — momentum, Adam, schedulers — is a refinement of this loop.
The intuition
Picture the loss as a hilly landscape and the model as a ball. The gradient tells you which way is uphill; you roll downhill. Too large a learning rate and the ball overshoots and bounces around; too small and it crawls.
Where it gets interesting
- Stochastic gradient descent estimates the gradient from small batches, trading noise for speed — and that noise often helps generalization.
- The gradient itself is computed by backpropagation, the chain rule applied across the network’s layers.
- The shape of
Lis set by the choice of loss — see Cross-Entropy Loss.
The human parallel: a learning rate that’s too high looks like cramming (big, unstable jumps); spaced, moderate steps converge more reliably — a theme in Spaced Repetition Meets Curriculum Learning.
📄 Raw source for this note lives in the corpus: /llms-full.txt