Chapter 8: Reinforcement Learning
Learn how an agent improves through rewards and penalties. Explore robot learning, policies, Q-learning, exploration, exploitation and interactive reward simulation.
Acts
Responds
Guides
Improves
8.1 Chapter Overview
Reinforcement Learning, or RL, is a Machine Learning approach where a model learns by interacting with an environment. The learner, called an agent, takes actions and receives rewards or penalties. Over time, the agent learns which actions lead to better long-term results.
A common example is a robot learning how to move correctly. If the robot moves closer to its target, it receives a positive reward. If it hits a wall or moves into danger, it receives a penalty. Through repeated practice, the robot improves its movement strategy.
8.2 What is Reinforcement Learning?
Reinforcement Learning is inspired by trial-and-error learning. Instead of learning from labelled answers, the agent learns from feedback. The feedback may be positive, negative or neutral.
| Term | Meaning | Example |
|---|---|---|
| Agent | The learner or decision maker. | Robot, game player, trading bot |
| Environment | The world where the agent acts. | Room, game board, road, factory floor |
| State | The current situation of the agent. | Robot location on a grid |
| Action | Choice made by the agent. | Move up, down, left or right |
| Reward | Positive feedback for good action. | +10 for reaching goal |
| Penalty | Negative feedback for bad action. | -10 for hitting danger |
| Policy | Strategy used to choose actions. | Move toward highest reward |
8.3 Supervised vs Unsupervised vs Reinforcement Learning
| Learning Type | How It Learns | Example |
|---|---|---|
| Supervised Learning | Learns from labelled examples. | Predict pass/fail from past labelled records |
| Unsupervised Learning | Finds hidden patterns without labels. | Group customers into segments |
| Reinforcement Learning | Learns from rewards and penalties. | Robot learns to reach a target |
8.4 Robot Learning Example
Imagine a robot inside a 5 × 5 grid. The robot starts at one corner and must reach the goal. Some cells may contain danger. The robot can move up, down, left or right.
| Robot Action | Feedback | Learning Meaning |
|---|---|---|
| Moves closer to goal | Small positive reward | Good direction |
| Reaches goal | Large positive reward | Best outcome |
| Hits wall | Penalty | Invalid move |
| Steps into danger | Large penalty | Avoid this action |
8.5 Reward Function
The reward function defines how the environment gives feedback. Good reward design is very important. If rewards are poorly designed, the agent may learn the wrong behavior.
| Situation | Reward Value | Reason |
|---|---|---|
| Reach goal | +10 | Desired final outcome |
| Move to normal cell | -1 | Encourages shorter path |
| Hit wall | -5 | Discourages invalid movement |
| Enter danger cell | -10 | Strongly discourages unsafe action |
8.6 Exploration vs Exploitation
Reinforcement Learning must balance two behaviors:
Exploration
The agent tries new actions to discover better rewards. Example: robot tries a new path.
Exploitation
The agent uses the best-known action. Example: robot follows the path that previously worked.
8.7 Policy, Value and Q-Value
| Concept | Meaning | Example |
|---|---|---|
| Policy | Strategy for choosing actions. | If at cell A, move right |
| Value | How good a state is. | Cells near goal have higher value |
| Q-Value | How good an action is in a state. | Move right from this cell has value 8 |
The agent learns Q-values to decide which action is best in each state.
8.8 Q-Learning
Q-Learning is a popular reinforcement learning algorithm. It learns the value of actions in different states. These values are stored in a Q-table.
| Symbol | Meaning |
|---|---|
| Q(s,a) | Current Q-value for state s and action a |
| α alpha | Learning rate |
| r | Reward received |
| γ gamma | Discount factor for future rewards |
| s' | New state after action |
| max Q(s',a') | Best future reward from next state |
8.9 Interactive Robot Reward Simulator
Use the buttons to move the robot. The robot receives rewards or penalties depending on the action. This simple simulator helps learners understand state, action, reward and penalty.
Robot Grid World
Goal: Reach 🏁. Avoid ⚠️. Normal movement gives -1 reward, danger gives -10, goal gives +10.
Current Reward: 0
Total Reward: 0
Message: Start moving the robot.
Interactive Reward Graph
Each bar represents the reward received after an action.
8.10 Python Example: Simple Grid Reward Logic
# Simple reinforcement learning style reward system
robot_position = 0
goal_position = 4
danger_position = 2
total_reward = 0
actions = ["right", "right", "right", "right"]
for action in actions:
if action == "right":
robot_position += 1
if robot_position == goal_position:
reward = 10
print("Robot reached the goal.")
elif robot_position == danger_position:
reward = -10
print("Robot entered danger.")
else:
reward = -1
print("Robot moved to position:", robot_position)
total_reward += reward
print("Reward:", reward)
print("Total Reward:", total_reward)
print("-----")Robot moved to position: 1 | Reward: -1
Robot entered danger | Reward: -10
Robot moved to position: 3 | Reward: -1
Robot reached the goal | Reward: 10
8.11 Python Example: Q-Table Structure
A Q-table stores values for each state-action pair. Initially, values can start at zero.
import numpy as np states = 5 actions = 2 # 0 = left, 1 = right q_table = np.zeros((states, actions)) print(q_table)
A table of zeros with 5 rows and 2 columns.
Line-by-Line Explanation
| Code | Explanation |
|---|---|
| states = 5 | There are 5 possible positions. |
| actions = 2 | The robot can move left or right. |
| np.zeros((states, actions)) | Creates an empty Q-table filled with zeros. |
8.12 Python Example: Basic Q-Learning Update
old_q_value = 0
learning_rate = 0.1
reward = 10
discount_factor = 0.9
best_future_q = 5
new_q_value = old_q_value + learning_rate * (
reward + discount_factor * best_future_q - old_q_value
)
print("Updated Q-Value:", new_q_value)Updated Q-Value: 1.45
8.13 Mini Q-Learning Example: One-Dimensional Robot
This example trains a simple robot to move toward a goal in a one-dimensional line.
import numpy as np
import random
states = 5
actions = 2 # 0 = left, 1 = right
q_table = np.zeros((states, actions))
learning_rate = 0.1
discount_factor = 0.9
episodes = 100
goal_state = 4
for episode in range(episodes):
state = 0
while state != goal_state:
action = random.choice([0, 1])
if action == 0:
next_state = max(0, state - 1)
else:
next_state = min(goal_state, state + 1)
if next_state == goal_state:
reward = 10
else:
reward = -1
best_future_q = np.max(q_table[next_state])
q_table[state, action] = q_table[state, action] + learning_rate * (
reward + discount_factor * best_future_q - q_table[state, action]
)
state = next_state
print("Trained Q-Table:")
print(q_table)8.14 Applications of Reinforcement Learning
| Application Area | RL Use |
|---|---|
| Robotics | Robot movement, grasping objects, navigation |
| Games | AI players learning strategies |
| Self-Driving Vehicles | Decision making in traffic environments |
| Manufacturing | Optimizing machine control and production flow |
| Finance | Trading strategy optimization |
| Healthcare | Treatment recommendation strategies |
8.15 Common Beginner Mistakes
| Mistake | Problem | Correction |
|---|---|---|
| Reward is poorly designed | Agent learns wrong behavior | Design rewards carefully |
| No exploration | Agent may miss better actions | Allow exploration during training |
| Too much exploration | Agent behaves randomly | Reduce exploration over time |
| Training too few episodes | Agent may not learn enough | Train over many episodes |
| Ignoring environment rules | Invalid actions may happen | Define boundaries and penalties |
8.16 Hands-On Activities
Activity 1: Reward Table
Create a reward table for a robot moving in a classroom from door to charging station.
Activity 2: Grid World
Draw a 5 × 5 grid with start, goal and danger cells. Define rewards for each movement.
Activity 3: Q-Table
Create a Q-table with 6 states and 4 actions using NumPy.
Activity 4: Exploration vs Exploitation
Explain why a robot should sometimes try new paths instead of always repeating the same known path.
Mini Project: Robot Path Learner
Create a Python program where a robot learns to move from start to goal using rewards and penalties.
8.17 Interactive Final Assessment Quiz
Each correct answer gives +1 mark. Each wrong answer gives -0.5 mark.
1. Reinforcement Learning learns through rewards and penalties.
2. In RL, the learner is called:
3. The environment gives feedback to the agent.
4. A policy is:
5. Q-Learning uses a Q-table to store state-action values.
6. Exploration means trying new actions.
7. Exploitation means using the best-known action.
8. A robot reaching the goal should normally receive:
9. Poor reward design can cause wrong learning behavior.
10. RL can be used in robotics and games.
Your Score: 0
8.18 Chapter Summary
In this chapter, learners studied Reinforcement Learning, agents, environments, states, actions, rewards, penalties, policies, Q-values and Q-learning. Learners also explored robot learning through an interactive grid simulator and Python examples.