Gradient Descent — pedagogy.dev

Gradient descent is the engine of machine learning. A model “learns” by repeatedly nudging its parameters in the direction that reduces error.

The update rule

Given parameters θ and a loss function L(θ) that measures how wrong the model is:

θ ← θ − η · ∇L(θ)

∇L(θ) is the gradient — the direction of steepest increase in loss.
We step in the opposite direction (the minus sign) to decrease loss.
η (eta) is the learning rate — how big each step is.

Repeat until the loss stops improving. That’s it. Everything else — momentum, Adam, schedulers — is a refinement of this loop.

The intuition

Picture the loss as a hilly landscape and the model as a ball. The gradient tells you which way is uphill; you roll downhill. Too large a learning rate and the ball overshoots and bounces around; too small and it crawls.

Where it gets interesting

Stochastic gradient descent estimates the gradient from small batches, trading noise for speed — and that noise often helps generalization.
The gradient itself is computed by backpropagation, the chain rule applied across the network’s layers.
The shape of L is set by the choice of loss — see Cross-Entropy Loss.

The human parallel: a learning rate that’s too high looks like cramming (big, unstable jumps); spaced, moderate steps converge more reliably — a theme in Spaced Repetition Meets Curriculum Learning.