Chapter 8 - Reinforcement Learning

8.1 Chapter Overview

Reinforcement Learning, or RL, is a Machine Learning approach where a model learns by interacting with an environment. The learner, called an agent, takes actions and receives rewards or penalties. Over time, the agent learns which actions lead to better long-term results.

A common example is a robot learning how to move correctly. If the robot moves closer to its target, it receives a positive reward. If it hits a wall or moves into danger, it receives a penalty. Through repeated practice, the robot improves its movement strategy.

Learning Outcome: By the end of this chapter, learners should be able to explain reinforcement learning, identify key RL components, understand rewards and penalties, describe robot learning, and implement simple Q-learning logic using Python.

1Agent Observes State

2Agent Takes Action

3Environment Responds

4Reward or Penalty

5Policy Improves

8.2 What is Reinforcement Learning?

Reinforcement Learning is inspired by trial-and-error learning. Instead of learning from labelled answers, the agent learns from feedback. The feedback may be positive, negative or neutral.

Term	Meaning	Example
Agent	The learner or decision maker.	Robot, game player, trading bot
Environment	The world where the agent acts.	Room, game board, road, factory floor
State	The current situation of the agent.	Robot location on a grid
Action	Choice made by the agent.	Move up, down, left or right
Reward	Positive feedback for good action.	+10 for reaching goal
Penalty	Negative feedback for bad action.	-10 for hitting danger
Policy	Strategy used to choose actions.	Move toward highest reward

8.3 Supervised vs Unsupervised vs Reinforcement Learning

Learning Type	How It Learns	Example
Supervised Learning	Learns from labelled examples.	Predict pass/fail from past labelled records
Unsupervised Learning	Finds hidden patterns without labels.	Group customers into segments
Reinforcement Learning	Learns from rewards and penalties.	Robot learns to reach a target

Simple Difference: Supervised learning learns from answer keys, unsupervised learning finds patterns, and reinforcement learning learns from consequences.

8.4 Robot Learning Example

Imagine a robot inside a 5 × 5 grid. The robot starts at one corner and must reach the goal. Some cells may contain danger. The robot can move up, down, left or right.

Robot Action	Feedback	Learning Meaning
Moves closer to goal	Small positive reward	Good direction
Reaches goal	Large positive reward	Best outcome
Hits wall	Penalty	Invalid move
Steps into danger	Large penalty	Avoid this action

Learning Goal: Maximize total future reward

8.5 Reward Function

The reward function defines how the environment gives feedback. Good reward design is very important. If rewards are poorly designed, the agent may learn the wrong behavior.

Situation	Reward Value	Reason
Reach goal	+10	Desired final outcome
Move to normal cell	-1	Encourages shorter path
Hit wall	-5	Discourages invalid movement
Enter danger cell	-10	Strongly discourages unsafe action

Important: Rewards should guide the agent toward useful behavior, not simply make it move randomly.

8.6 Exploration vs Exploitation

Reinforcement Learning must balance two behaviors:

Exploration

The agent tries new actions to discover better rewards. Example: robot tries a new path.

Exploitation

The agent uses the best-known action. Example: robot follows the path that previously worked.

Good RL = Balance Exploration and Exploitation

8.7 Policy, Value and Q-Value

Concept	Meaning	Example
Policy	Strategy for choosing actions.	If at cell A, move right
Value	How good a state is.	Cells near goal have higher value
Q-Value	How good an action is in a state.	Move right from this cell has value 8

The agent learns Q-values to decide which action is best in each state.

8.8 Q-Learning

Q-Learning is a popular reinforcement learning algorithm. It learns the value of actions in different states. These values are stored in a Q-table.

Q(s,a) ← Q(s,a) + α [ r + γ max Q(s',a') - Q(s,a) ]

Symbol	Meaning
Q(s,a)	Current Q-value for state s and action a
α alpha	Learning rate
r	Reward received
γ gamma	Discount factor for future rewards
s'	New state after action
max Q(s',a')	Best future reward from next state

Simple Meaning: Q-Learning updates the agent's memory by combining current reward and expected future reward.

8.9 Interactive Robot Reward Simulator

Use the buttons to move the robot. The robot receives rewards or penalties depending on the action. This simple simulator helps learners understand state, action, reward and penalty.

Robot Grid World

Goal: Reach 🏁. Avoid ⚠️. Normal movement gives -1 reward, danger gives -10, goal gives +10.

Current Reward: 0

Total Reward: 0

Message: Start moving the robot.

Interactive Reward Graph

Each bar represents the reward received after an action.

8.10 Python Example: Simple Grid Reward Logic

# Simple reinforcement learning style reward system

robot_position = 0
goal_position = 4
danger_position = 2
total_reward = 0

actions = ["right", "right", "right", "right"]

for action in actions:
    if action == "right":
        robot_position += 1

    if robot_position == goal_position:
        reward = 10
        print("Robot reached the goal.")
    elif robot_position == danger_position:
        reward = -10
        print("Robot entered danger.")
    else:
        reward = -1
        print("Robot moved to position:", robot_position)

    total_reward += reward
    print("Reward:", reward)
    print("Total Reward:", total_reward)
    print("-----")

Expected Output:
Robot moved to position: 1 | Reward: -1
Robot entered danger | Reward: -10
Robot moved to position: 3 | Reward: -1
Robot reached the goal | Reward: 10

8.11 Python Example: Q-Table Structure

A Q-table stores values for each state-action pair. Initially, values can start at zero.

import numpy as np

states = 5
actions = 2   # 0 = left, 1 = right

q_table = np.zeros((states, actions))

print(q_table)

Expected Output:
A table of zeros with 5 rows and 2 columns.

Line-by-Line Explanation

Code	Explanation
states = 5	There are 5 possible positions.
actions = 2	The robot can move left or right.
np.zeros((states, actions))	Creates an empty Q-table filled with zeros.

8.12 Python Example: Basic Q-Learning Update

old_q_value = 0
learning_rate = 0.1
reward = 10
discount_factor = 0.9
best_future_q = 5

new_q_value = old_q_value + learning_rate * (
    reward + discount_factor * best_future_q - old_q_value
)

print("Updated Q-Value:", new_q_value)

Expected Output:
Updated Q-Value: 1.45

Explanation: The Q-value increases because the action received a good reward and has good future potential.

8.13 Mini Q-Learning Example: One-Dimensional Robot

This example trains a simple robot to move toward a goal in a one-dimensional line.

import numpy as np
import random

states = 5
actions = 2  # 0 = left, 1 = right

q_table = np.zeros((states, actions))

learning_rate = 0.1
discount_factor = 0.9
episodes = 100

goal_state = 4

for episode in range(episodes):
    state = 0

    while state != goal_state:
        action = random.choice([0, 1])

        if action == 0:
            next_state = max(0, state - 1)
        else:
            next_state = min(goal_state, state + 1)

        if next_state == goal_state:
            reward = 10
        else:
            reward = -1

        best_future_q = np.max(q_table[next_state])

        q_table[state, action] = q_table[state, action] + learning_rate * (
            reward + discount_factor * best_future_q - q_table[state, action]
        )

        state = next_state

print("Trained Q-Table:")
print(q_table)

Learning Note: After many episodes, the robot learns that moving right eventually leads to the goal and higher reward.

8.14 Applications of Reinforcement Learning

Application Area	RL Use
Robotics	Robot movement, grasping objects, navigation
Games	AI players learning strategies
Self-Driving Vehicles	Decision making in traffic environments
Manufacturing	Optimizing machine control and production flow
Finance	Trading strategy optimization
Healthcare	Treatment recommendation strategies

8.15 Common Beginner Mistakes

Mistake	Problem	Correction
Reward is poorly designed	Agent learns wrong behavior	Design rewards carefully
No exploration	Agent may miss better actions	Allow exploration during training
Too much exploration	Agent behaves randomly	Reduce exploration over time
Training too few episodes	Agent may not learn enough	Train over many episodes
Ignoring environment rules	Invalid actions may happen	Define boundaries and penalties

8.16 Hands-On Activities

Activity 1: Reward Table

Create a reward table for a robot moving in a classroom from door to charging station.

Activity 2: Grid World

Draw a 5 × 5 grid with start, goal and danger cells. Define rewards for each movement.

Activity 3: Q-Table

Create a Q-table with 6 states and 4 actions using NumPy.

Activity 4: Exploration vs Exploitation

Explain why a robot should sometimes try new paths instead of always repeating the same known path.

Mini Project: Robot Path Learner

Create a Python program where a robot learns to move from start to goal using rewards and penalties.

8.17 Interactive Final Assessment Quiz

Each correct answer gives +1 mark. Each wrong answer gives -0.5 mark.

1. Reinforcement Learning learns through rewards and penalties.

True False

2. In RL, the learner is called:

Agent Spreadsheet Browser Printer

3. The environment gives feedback to the agent.

True False

4. A policy is:

Strategy for choosing actions A chart color A missing value A file extension

5. Q-Learning uses a Q-table to store state-action values.

True False

6. Exploration means trying new actions.

True False

7. Exploitation means using the best-known action.

True False

8. A robot reaching the goal should normally receive:

Positive reward Negative penalty only No feedback ever Deleted data

9. Poor reward design can cause wrong learning behavior.

True False

10. RL can be used in robotics and games.

True False

Your Score: 0

8.18 Chapter Summary

In this chapter, learners studied Reinforcement Learning, agents, environments, states, actions, rewards, penalties, policies, Q-values and Q-learning. Learners also explored robot learning through an interactive grid simulator and Python examples.

Remember: Reinforcement Learning is about learning from consequences. The agent improves by repeatedly acting, receiving feedback and updating its strategy.