Chapter 8: Reinforcement Learning

Learn how an agent improves through rewards and penalties. Explore robot learning, policies, Q-learning, exploration, exploitation and interactive reward simulation.

AgentRewardPenaltyQ-LearningRobot Learning
Agent
Acts
Environment
Responds
Reward
Guides
Policy
Improves

8.1 Chapter Overview

Reinforcement Learning, or RL, is a Machine Learning approach where a model learns by interacting with an environment. The learner, called an agent, takes actions and receives rewards or penalties. Over time, the agent learns which actions lead to better long-term results.

A common example is a robot learning how to move correctly. If the robot moves closer to its target, it receives a positive reward. If it hits a wall or moves into danger, it receives a penalty. Through repeated practice, the robot improves its movement strategy.

Learning Outcome: By the end of this chapter, learners should be able to explain reinforcement learning, identify key RL components, understand rewards and penalties, describe robot learning, and implement simple Q-learning logic using Python.
1Agent Observes State
2Agent Takes Action
3Environment Responds
4Reward or Penalty
5Policy Improves

8.2 What is Reinforcement Learning?

Reinforcement Learning is inspired by trial-and-error learning. Instead of learning from labelled answers, the agent learns from feedback. The feedback may be positive, negative or neutral.

TermMeaningExample
AgentThe learner or decision maker.Robot, game player, trading bot
EnvironmentThe world where the agent acts.Room, game board, road, factory floor
StateThe current situation of the agent.Robot location on a grid
ActionChoice made by the agent.Move up, down, left or right
RewardPositive feedback for good action.+10 for reaching goal
PenaltyNegative feedback for bad action.-10 for hitting danger
PolicyStrategy used to choose actions.Move toward highest reward

8.3 Supervised vs Unsupervised vs Reinforcement Learning

Learning TypeHow It LearnsExample
Supervised LearningLearns from labelled examples.Predict pass/fail from past labelled records
Unsupervised LearningFinds hidden patterns without labels.Group customers into segments
Reinforcement LearningLearns from rewards and penalties.Robot learns to reach a target
Simple Difference: Supervised learning learns from answer keys, unsupervised learning finds patterns, and reinforcement learning learns from consequences.

8.4 Robot Learning Example

Imagine a robot inside a 5 × 5 grid. The robot starts at one corner and must reach the goal. Some cells may contain danger. The robot can move up, down, left or right.

Robot ActionFeedbackLearning Meaning
Moves closer to goalSmall positive rewardGood direction
Reaches goalLarge positive rewardBest outcome
Hits wallPenaltyInvalid move
Steps into dangerLarge penaltyAvoid this action
Learning Goal: Maximize total future reward

8.5 Reward Function

The reward function defines how the environment gives feedback. Good reward design is very important. If rewards are poorly designed, the agent may learn the wrong behavior.

SituationReward ValueReason
Reach goal+10Desired final outcome
Move to normal cell-1Encourages shorter path
Hit wall-5Discourages invalid movement
Enter danger cell-10Strongly discourages unsafe action
Important: Rewards should guide the agent toward useful behavior, not simply make it move randomly.

8.6 Exploration vs Exploitation

Reinforcement Learning must balance two behaviors:

Exploration

The agent tries new actions to discover better rewards. Example: robot tries a new path.

Exploitation

The agent uses the best-known action. Example: robot follows the path that previously worked.

Good RL = Balance Exploration and Exploitation

8.7 Policy, Value and Q-Value

ConceptMeaningExample
PolicyStrategy for choosing actions.If at cell A, move right
ValueHow good a state is.Cells near goal have higher value
Q-ValueHow good an action is in a state.Move right from this cell has value 8

The agent learns Q-values to decide which action is best in each state.

8.8 Q-Learning

Q-Learning is a popular reinforcement learning algorithm. It learns the value of actions in different states. These values are stored in a Q-table.

Q(s,a) ← Q(s,a) + α [ r + γ max Q(s',a') - Q(s,a) ]
SymbolMeaning
Q(s,a)Current Q-value for state s and action a
α alphaLearning rate
rReward received
γ gammaDiscount factor for future rewards
s'New state after action
max Q(s',a')Best future reward from next state
Simple Meaning: Q-Learning updates the agent's memory by combining current reward and expected future reward.

8.9 Interactive Robot Reward Simulator

Use the buttons to move the robot. The robot receives rewards or penalties depending on the action. This simple simulator helps learners understand state, action, reward and penalty.

Robot Grid World

Goal: Reach 🏁. Avoid ⚠️. Normal movement gives -1 reward, danger gives -10, goal gives +10.

Current Reward: 0

Total Reward: 0

Message: Start moving the robot.

Interactive Reward Graph

Each bar represents the reward received after an action.

8.10 Python Example: Simple Grid Reward Logic

# Simple reinforcement learning style reward system

robot_position = 0
goal_position = 4
danger_position = 2
total_reward = 0

actions = ["right", "right", "right", "right"]

for action in actions:
    if action == "right":
        robot_position += 1

    if robot_position == goal_position:
        reward = 10
        print("Robot reached the goal.")
    elif robot_position == danger_position:
        reward = -10
        print("Robot entered danger.")
    else:
        reward = -1
        print("Robot moved to position:", robot_position)

    total_reward += reward
    print("Reward:", reward)
    print("Total Reward:", total_reward)
    print("-----")
Expected Output:
Robot moved to position: 1 | Reward: -1
Robot entered danger | Reward: -10
Robot moved to position: 3 | Reward: -1
Robot reached the goal | Reward: 10

8.11 Python Example: Q-Table Structure

A Q-table stores values for each state-action pair. Initially, values can start at zero.

import numpy as np

states = 5
actions = 2   # 0 = left, 1 = right

q_table = np.zeros((states, actions))

print(q_table)
Expected Output:
A table of zeros with 5 rows and 2 columns.

Line-by-Line Explanation

CodeExplanation
states = 5There are 5 possible positions.
actions = 2The robot can move left or right.
np.zeros((states, actions))Creates an empty Q-table filled with zeros.

8.12 Python Example: Basic Q-Learning Update

old_q_value = 0
learning_rate = 0.1
reward = 10
discount_factor = 0.9
best_future_q = 5

new_q_value = old_q_value + learning_rate * (
    reward + discount_factor * best_future_q - old_q_value
)

print("Updated Q-Value:", new_q_value)
Expected Output:
Updated Q-Value: 1.45
Explanation: The Q-value increases because the action received a good reward and has good future potential.

8.13 Mini Q-Learning Example: One-Dimensional Robot

This example trains a simple robot to move toward a goal in a one-dimensional line.

import numpy as np
import random

states = 5
actions = 2  # 0 = left, 1 = right

q_table = np.zeros((states, actions))

learning_rate = 0.1
discount_factor = 0.9
episodes = 100

goal_state = 4

for episode in range(episodes):
    state = 0

    while state != goal_state:
        action = random.choice([0, 1])

        if action == 0:
            next_state = max(0, state - 1)
        else:
            next_state = min(goal_state, state + 1)

        if next_state == goal_state:
            reward = 10
        else:
            reward = -1

        best_future_q = np.max(q_table[next_state])

        q_table[state, action] = q_table[state, action] + learning_rate * (
            reward + discount_factor * best_future_q - q_table[state, action]
        )

        state = next_state

print("Trained Q-Table:")
print(q_table)
Learning Note: After many episodes, the robot learns that moving right eventually leads to the goal and higher reward.

8.14 Applications of Reinforcement Learning

Application AreaRL Use
RoboticsRobot movement, grasping objects, navigation
GamesAI players learning strategies
Self-Driving VehiclesDecision making in traffic environments
ManufacturingOptimizing machine control and production flow
FinanceTrading strategy optimization
HealthcareTreatment recommendation strategies

8.15 Common Beginner Mistakes

MistakeProblemCorrection
Reward is poorly designedAgent learns wrong behaviorDesign rewards carefully
No explorationAgent may miss better actionsAllow exploration during training
Too much explorationAgent behaves randomlyReduce exploration over time
Training too few episodesAgent may not learn enoughTrain over many episodes
Ignoring environment rulesInvalid actions may happenDefine boundaries and penalties

8.16 Hands-On Activities

Activity 1: Reward Table

Create a reward table for a robot moving in a classroom from door to charging station.

Activity 2: Grid World

Draw a 5 × 5 grid with start, goal and danger cells. Define rewards for each movement.

Activity 3: Q-Table

Create a Q-table with 6 states and 4 actions using NumPy.

Activity 4: Exploration vs Exploitation

Explain why a robot should sometimes try new paths instead of always repeating the same known path.

Mini Project: Robot Path Learner

Create a Python program where a robot learns to move from start to goal using rewards and penalties.

8.17 Interactive Final Assessment Quiz

Each correct answer gives +1 mark. Each wrong answer gives -0.5 mark.

1. Reinforcement Learning learns through rewards and penalties.

2. In RL, the learner is called:

3. The environment gives feedback to the agent.

4. A policy is:

5. Q-Learning uses a Q-table to store state-action values.

6. Exploration means trying new actions.

7. Exploitation means using the best-known action.

8. A robot reaching the goal should normally receive:

9. Poor reward design can cause wrong learning behavior.

10. RL can be used in robotics and games.

Your Score: 0

8.18 Chapter Summary

In this chapter, learners studied Reinforcement Learning, agents, environments, states, actions, rewards, penalties, policies, Q-values and Q-learning. Learners also explored robot learning through an interactive grid simulator and Python examples.

Remember: Reinforcement Learning is about learning from consequences. The agent improves by repeatedly acting, receiving feedback and updating its strategy.