Reinforcement Learning
How agents learn through trial and error — rewards, policies, Q-learning, and real-world applications
What is Reinforcement Learning?
Reinforcement Learning (RL) is a type of machine learning where an agent learns to make decisions by interacting with an environment. Unlike supervised learning (which needs labeled data), RL learns from rewards and penalties received after taking actions — much like how a child learns to walk through trial and error.
🎮 The Core Loop:
Agent observes state → takes action → receives reward → transitions to new state → repeat. The goal: maximize cumulative reward over time.
Key Concepts
🤖 Agent & Environment
The agent is the learner/decision-maker. The environment is everything the agent interacts with.
Example:
A robot (agent) navigating a maze (environment). Each step is an action; reaching the exit earns a reward.
🎯 Policy (π)
A strategy that maps states to actions. The agent's "playbook" for what to do in each situation.
Types:
Deterministic: always same action for a state. Stochastic: probability distribution over actions.
💰 Reward & Return
Immediate feedback signal. The return is the cumulative (often discounted) reward over an episode.
Discount factor (γ):
γ ∈ [0,1] controls how much future rewards matter. γ=0: greedy. γ=0.99: far-sighted.
📊 Value Function
Predicts how good a state (or state-action pair) is in terms of expected future reward.
V(s) vs Q(s,a):
V(s): value of being in state s. Q(s,a): value of taking action a in state s.
Exploration vs Exploitation
This is the fundamental dilemma in RL:
🔍 Exploration
Try new, potentially better actions to discover more about the environment.
⚡ Exploitation
Use current knowledge to maximize reward by picking the best known action.
ε-Greedy Strategy:
With probability ε, explore (random action). With probability 1-ε, exploit (best known action). Start with high ε and decay it over time.
Q-Learning Algorithm
The most foundational RL algorithm — learns the Q-value function without needing a model of the environment:
// Q-Learning implementation
class QLearning {
constructor(numStates, numActions, learningRate = 0.1, discount = 0.99, epsilon = 1.0) {
this.qTable = Array.from({ length: numStates }, () =>
new Array(numActions).fill(0)
);
this.lr = learningRate;
this.gamma = discount;
this.epsilon = epsilon;
this.epsilonDecay = 0.995;
this.epsilonMin = 0.01;
}
// Choose action using epsilon-greedy policy
chooseAction(state) {
if (Math.random() < this.epsilon) {
// Explore: random action
return Math.floor(Math.random() * this.qTable[0].length);
}
// Exploit: best known action
return this.qTable[state].indexOf(Math.max(...this.qTable[state]));
}
// Update Q-value after taking action
update(state, action, reward, nextState) {
const currentQ = this.qTable[state][action];
const maxNextQ = Math.max(...this.qTable[nextState]);
// Q-learning update rule (Bellman equation)
this.qTable[state][action] = currentQ +
this.lr * (reward + this.gamma * maxNextQ - currentQ);
// Decay exploration rate
this.epsilon = Math.max(this.epsilonMin, this.epsilon * this.epsilonDecay);
}
}
// Training loop
const agent = new QLearning(100, 4); // 100 states, 4 actions (up/down/left/right)
for (let episode = 0; episode < 10000; episode++) {
let state = env.reset();
let totalReward = 0;
while (!env.isDone()) {
const action = agent.chooseAction(state);
const { nextState, reward, done } = env.step(action);
agent.update(state, action, reward, nextState);
state = nextState;
totalReward += reward;
}
}
Deep Reinforcement Learning
When state spaces are too large for Q-tables (like images), we use neural networks to approximate the value function:
DQN (Deep Q-Network)
Replaces Q-table with a neural network. Uses experience replay (stores past transitions and samples mini-batches) and a target network (separate network updated slowly for stability).
🏆 DeepMind used DQN to play Atari games at superhuman level (2013)
Policy Gradient Methods
Instead of learning a value function, directly optimize the policy. Uses gradient ascent on expected reward. Better for continuous action spaces.
Variants: REINFORCE, PPO (used to train ChatGPT via RLHF), A2C/A3C
Actor-Critic Methods
Combine value-based and policy-based approaches. The actor selects actions, the critic evaluates them. Best of both worlds.
PPO (Proximal Policy Optimization) is the industry standard for RLHF in LLMs
RLHF: Reinforcement Learning from Human Feedback
The technique that made ChatGPT possible:
- Step 1 — Supervised Fine-Tuning: Train a base LLM on curated instruction-following data
- Step 2 — Reward Model: Humans rank multiple LLM outputs. A reward model is trained to predict human preferences
- Step 3 — PPO Optimization: Use the reward model as the reward signal in PPO to fine-tune the LLM to generate outputs humans prefer
This is how models learn to be helpful, harmless, and honest — beyond just predicting the next token.
Real-World RL Applications
🔑 Key Takeaways
- • RL learns by interacting with environments, not from labeled datasets
- • The exploration/exploitation tradeoff is fundamental to all RL
- • Q-learning is the foundation; deep RL scales it to complex problems
- • RLHF is the bridge between raw LLMs and helpful AI assistants
- • PPO is the workhorse algorithm behind modern AI alignment