Reinforcement Learning - Learn AI | TechLead

What is Reinforcement Learning?

Reinforcement Learning (RL) is a type of machine learning where an agent learns to make decisions by interacting with an environment. Unlike supervised learning (which needs labeled data), RL learns from rewards and penalties received after taking actions — much like how a child learns to walk through trial and error.

🎮 The Core Loop:

Agent observes state → takes action → receives reward → transitions to new state → repeat. The goal: maximize cumulative reward over time.

Key Concepts

🤖 Agent & Environment

The agent is the learner/decision-maker. The environment is everything the agent interacts with.

Example:

A robot (agent) navigating a maze (environment). Each step is an action; reaching the exit earns a reward.

🎯 Policy (π)

A strategy that maps states to actions. The agent's "playbook" for what to do in each situation.

Types:

Deterministic: always same action for a state. Stochastic: probability distribution over actions.

💰 Reward & Return

Immediate feedback signal. The return is the cumulative (often discounted) reward over an episode.

Discount factor (γ):

γ ∈ [0,1] controls how much future rewards matter. γ=0: greedy. γ=0.99: far-sighted.

📊 Value Function

Predicts how good a state (or state-action pair) is in terms of expected future reward.

V(s) vs Q(s,a):

V(s): value of being in state s. Q(s,a): value of taking action a in state s.

Exploration vs Exploitation

This is the fundamental dilemma in RL:

🔍 Exploration

Try new, potentially better actions to discover more about the environment.

⚡ Exploitation

Use current knowledge to maximize reward by picking the best known action.

ε-Greedy Strategy:

With probability ε, explore (random action). With probability 1-ε, exploit (best known action). Start with high ε and decay it over time.

Q-Learning Algorithm

The most foundational RL algorithm — learns the Q-value function without needing a model of the environment:

// Q-Learning implementation
class QLearning {
  constructor(numStates, numActions, learningRate = 0.1, discount = 0.99, epsilon = 1.0) {
    this.qTable = Array.from({ length: numStates }, () =>
      new Array(numActions).fill(0)
    );
    this.lr = learningRate;
    this.gamma = discount;
    this.epsilon = epsilon;
    this.epsilonDecay = 0.995;
    this.epsilonMin = 0.01;
  }

  // Choose action using epsilon-greedy policy
  chooseAction(state) {
    if (Math.random() < this.epsilon) {
      // Explore: random action
      return Math.floor(Math.random() * this.qTable[0].length);
    }
    // Exploit: best known action
    return this.qTable[state].indexOf(Math.max(...this.qTable[state]));
  }

  // Update Q-value after taking action
  update(state, action, reward, nextState) {
    const currentQ = this.qTable[state][action];
    const maxNextQ = Math.max(...this.qTable[nextState]);

    // Q-learning update rule (Bellman equation)
    this.qTable[state][action] = currentQ +
      this.lr * (reward + this.gamma * maxNextQ - currentQ);

    // Decay exploration rate
    this.epsilon = Math.max(this.epsilonMin, this.epsilon * this.epsilonDecay);
  }
}

// Training loop
const agent = new QLearning(100, 4); // 100 states, 4 actions (up/down/left/right)

for (let episode = 0; episode < 10000; episode++) {
  let state = env.reset();
  let totalReward = 0;

  while (!env.isDone()) {
    const action = agent.chooseAction(state);
    const { nextState, reward, done } = env.step(action);
    agent.update(state, action, reward, nextState);
    state = nextState;
    totalReward += reward;
  }
}

Deep Reinforcement Learning

When state spaces are too large for Q-tables (like images), we use neural networks to approximate the value function:

DQN (Deep Q-Network)

Replaces Q-table with a neural network. Uses experience replay (stores past transitions and samples mini-batches) and a target network (separate network updated slowly for stability).

🏆 DeepMind used DQN to play Atari games at superhuman level (2013)

Policy Gradient Methods

Instead of learning a value function, directly optimize the policy. Uses gradient ascent on expected reward. Better for continuous action spaces.

Variants: REINFORCE, PPO (used to train ChatGPT via RLHF), A2C/A3C

Actor-Critic Methods

Combine value-based and policy-based approaches. The actor selects actions, the critic evaluates them. Best of both worlds.

PPO (Proximal Policy Optimization) is the industry standard for RLHF in LLMs

RLHF: Reinforcement Learning from Human Feedback

The technique that made ChatGPT possible:

Step 1 — Supervised Fine-Tuning: Train a base LLM on curated instruction-following data
Step 2 — Reward Model: Humans rank multiple LLM outputs. A reward model is trained to predict human preferences
Step 3 — PPO Optimization: Use the reward model as the reward signal in PPO to fine-tune the LLM to generate outputs humans prefer

This is how models learn to be helpful, harmless, and honest — beyond just predicting the next token.

Real-World RL Applications

🎮 Games: AlphaGo, AlphaZero, OpenAI Five (Dota 2)

🤖 Robotics: Locomotion, manipulation, autonomous navigation

🚗 Self-driving: Lane keeping, adaptive cruise control, parking

💊 Drug Discovery: Molecular optimization, treatment planning

📊 Finance: Portfolio optimization, algorithmic trading

🤖 LLMs: RLHF alignment, instruction following (ChatGPT, Claude)

🔑 Key Takeaways

• RL learns by interacting with environments, not from labeled datasets
• The exploration/exploitation tradeoff is fundamental to all RL
• Q-learning is the foundation; deep RL scales it to complex problems
• RLHF is the bridge between raw LLMs and helpful AI assistants
• PPO is the workhorse algorithm behind modern AI alignment