Chapter 6 - Supervised Learning Fundamentals

6.1 Chapter Overview

Supervised Learning is a Machine Learning approach where a model learns from labelled data. Labelled data contains both input features and the correct output. The model studies the relationship between input and output, then predicts outcomes for new unseen data.

Learning Outcome: By the end of this chapter, learners should be able to explain regression and classification, build predictive models using Python, evaluate performance, and interpret model results.

1Collect Labelled Data

2Split Dataset

3Train Model

4Predict

5Evaluate

6.2 What is Supervised Learning?

Supervised Learning means the model is trained using examples where the correct answer is already known. The learning process is similar to a student learning from questions with answer keys.

Component	Meaning	Example
Features X	Input values used for prediction	Study hours, attendance, assignments
Target y	Correct output or label	Exam score or Pass/Fail
Training Data	Data used to teach the model	Past student records
Testing Data	Data used to evaluate model performance	New unseen records

6.3 Regression vs Classification

Type	Purpose	Output	Example
Regression	Predict continuous numeric values	Number	Predict exam marks, salary, sales
Classification	Predict categories or labels	Class	Pass/Fail, Spam/Not Spam, Fraud/Normal

Regression Question

How many marks will the student score?

Classification Question

Will the student pass or fail?

6.4 Linear Regression

Linear Regression is used when the target value is numeric. It tries to find a straight-line relationship between input and output.

y = mx + b

Symbol	Meaning
y	Predicted output
x	Input feature
m	Slope of the line
b	Intercept

Python Working Example: Predict Marks from Study Hours

from sklearn.linear_model import LinearRegression
import numpy as np

# X stores input features.
# Each value is placed inside another list because Scikit-learn expects 2D input.
X = np.array([[1], [2], [3], [4], [5]])

# y stores target values.
# These are the actual marks for each study hour value.
y = np.array([40, 50, 60, 70, 80])

# Create a Linear Regression model object.
model = LinearRegression()

# Train the model using input X and output y.
model.fit(X, y)

# Predict marks for a student who studies 6 hours.
prediction = model.predict([[6]])

print("Predicted Marks:", prediction[0])

Expected Output:
Predicted Marks: 90.0

Line-by-Line Explanation

Code	Explanation
from sklearn.linear_model import LinearRegression	Imports the Linear Regression algorithm.
import numpy as np	Imports NumPy to create numerical arrays.
X = np.array([[1], [2], [3], [4], [5]])	Creates the input feature dataset containing study hours.
y = np.array([40, 50, 60, 70, 80])	Creates the target values containing marks.
model = LinearRegression()	Creates the model.
model.fit(X, y)	Trains the model to learn the relationship between hours and marks.
model.predict([[6]])	Predicts marks for 6 study hours.

Expected Regression Graph

6h Pred

Graph Meaning: As study hours increase, predicted marks also increase.

6.5 Regression Visualization with Matplotlib

import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
import numpy as np

X = np.array([[1], [2], [3], [4], [5]])
y = np.array([40, 50, 60, 70, 80])

model = LinearRegression()
model.fit(X, y)

plt.scatter(X, y, label="Actual Data")
plt.plot(X, model.predict(X), label="Regression Line")
plt.xlabel("Study Hours")
plt.ylabel("Marks")
plt.title("Study Hours vs Marks")
plt.legend()
plt.show()

Explanation: scatter() shows real data points. plot() draws the learned regression line.

6.6 Train-Test Split

Train-test split divides data into two parts. The model learns from training data and is evaluated using testing data.

Training Data = Model Learning | Testing Data = Model Checking

from sklearn.model_selection import train_test_split
import numpy as np

X = np.array([[1], [2], [3], [4], [5], [6], [7], [8]])
y = np.array([35, 45, 50, 60, 65, 75, 85, 95])

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.25,
    random_state=42
)

print("Training Features:", X_train)
print("Testing Features:", X_test)
print("Training Labels:", y_train)
print("Testing Labels:", y_test)

Parameter	Explanation
test_size=0.25	25% data is used for testing.
random_state=42	Makes the split repeatable.

6.7 Regression Evaluation Metrics

Regression models are evaluated by comparing actual values with predicted values.

MAE = (1/n) Σ |y - ŷ|

MSE = (1/n) Σ (y - ŷ)²

RMSE = √MSE

R² Score = Measures how well the model explains the data

from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

actual = [80, 70, 90, 60]
predicted = [78, 74, 88, 65]

mae = mean_absolute_error(actual, predicted)
mse = mean_squared_error(actual, predicted)
rmse = np.sqrt(mse)
r2 = r2_score(actual, predicted)

print("MAE:", mae)
print("MSE:", mse)
print("RMSE:", rmse)
print("R2 Score:", r2)

Expected Output:
MAE: 3.25
MSE: 12.25
RMSE: 3.5
R2 Score: 0.902

6.8 Classification Fundamentals

Classification predicts categories. Instead of predicting marks, it predicts labels such as Pass or Fail.

Features	Target
Attendance, Assignment Marks	Pass / Fail
Email words	Spam / Not Spam
Transaction amount and behavior	Fraud / Normal

6.9 Logistic Regression for Classification

Logistic Regression is used for classification problems. It predicts probability and then assigns a class.

Output Probability between 0 and 1

from sklearn.linear_model import LogisticRegression
import numpy as np

# Features: attendance percentage
X = np.array([[90], [85], [45], [50], [95], [40]])

# Labels: 1 = Pass, 0 = Fail
y = np.array([1, 1, 0, 0, 1, 0])

model = LogisticRegression()

model.fit(X, y)

prediction = model.predict([[88]])

if prediction[0] == 1:
    print("Prediction: Pass")
else:
    print("Prediction: Fail")

Expected Output:
Prediction: Pass

Code	Explanation
X = np.array([[90], ...])	Stores attendance values as model input.
y = np.array([1, 1, 0...])	Stores pass/fail labels.
model.fit(X, y)	Trains the classifier.
model.predict([[88]])	Predicts whether 88% attendance is Pass or Fail.

6.10 Classification with Two Features

Most real classification problems use more than one feature. This example uses attendance and assignment marks.

from sklearn.linear_model import LogisticRegression
import numpy as np

# Features: [attendance, assignment_marks]
X = np.array([
    [90, 85],
    [80, 78],
    [45, 40],
    [55, 50],
    [95, 90],
    [35, 30]
])

# Labels: 1 = Pass, 0 = Fail
y = np.array([1, 1, 0, 0, 1, 0])

model = LogisticRegression()
model.fit(X, y)

new_student = [[88, 82]]

prediction = model.predict(new_student)

print("Prediction:", prediction[0])

Expected Output:
Prediction: 1
Meaning: The student is predicted to Pass.

6.11 Confusion Matrix

A confusion matrix compares actual labels with predicted labels.

	Predicted Pass	Predicted Fail
Actual Pass	True Positive	False Negative
Actual Fail	False Positive	True Negative

from sklearn.metrics import confusion_matrix

actual = [1, 1, 0, 0, 1, 0]
predicted = [1, 1, 0, 1, 1, 0]

matrix = confusion_matrix(actual, predicted)

print(matrix)

Expected Output:
[[2 1]
[0 3]]

6.12 Classification Evaluation Metrics

Accuracy = (TP + TN) / (TP + TN + FP + FN)

Precision = TP / (TP + FP)

Recall = TP / (TP + FN)

F1 Score = 2 × (Precision × Recall) / (Precision + Recall)

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

actual = [1, 1, 0, 0, 1, 0]
predicted = [1, 1, 0, 1, 1, 0]

print("Accuracy:", accuracy_score(actual, predicted))
print("Precision:", precision_score(actual, predicted))
print("Recall:", recall_score(actual, predicted))
print("F1 Score:", f1_score(actual, predicted))

6.13 Popular Supervised Learning Algorithms

Algorithm	Type	Use
Linear Regression	Regression	Predict numeric values
Logistic Regression	Classification	Predict binary classes
Decision Tree	Both	Rule-based prediction
K-Nearest Neighbors	Classification / Regression	Predict based on nearby examples
Support Vector Machine	Classification	Find separating boundary
Random Forest	Both	Combines many decision trees

6.14 Decision Tree Classifier

from sklearn.tree import DecisionTreeClassifier

X = [
    [90, 85],
    [80, 78],
    [45, 40],
    [55, 50],
    [95, 90],
    [35, 30]
]

y = [1, 1, 0, 0, 1, 0]

model = DecisionTreeClassifier()
model.fit(X, y)

prediction = model.predict([[75, 70]])

print("Prediction:", prediction[0])

Explanation: A decision tree learns rules such as: if attendance is high and marks are high, predict Pass.

6.15 K-Nearest Neighbors

KNN predicts based on the most similar nearby records.

from sklearn.neighbors import KNeighborsClassifier

X = [
    [90, 85],
    [80, 78],
    [45, 40],
    [55, 50],
    [95, 90],
    [35, 30]
]

y = [1, 1, 0, 0, 1, 0]

model = KNeighborsClassifier(n_neighbors=3)
model.fit(X, y)

prediction = model.predict([[78, 74]])

print("Prediction:", prediction[0])

Explanation: n_neighbors=3 means the model checks the 3 nearest records and uses majority voting.

6.16 Random Forest Classifier

Random Forest combines many decision trees to produce a stronger model.

from sklearn.ensemble import RandomForestClassifier

X = [
    [90, 85],
    [80, 78],
    [45, 40],
    [55, 50],
    [95, 90],
    [35, 30]
]

y = [1, 1, 0, 0, 1, 0]

model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X, y)

prediction = model.predict([[88, 80]])

print("Prediction:", prediction[0])

6.17 Complete Mini Project: Student Performance Prediction

This project combines data preparation, train-test split, model training, prediction and evaluation.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix

data = {
    "Attendance": [90, 80, 45, 55, 95, 35, 85, 60],
    "Assignment": [85, 78, 40, 50, 90, 30, 82, 58],
    "Pass": [1, 1, 0, 0, 1, 0, 1, 0]
}

df = pd.DataFrame(data)

X = df[["Attendance", "Assignment"]]
y = df["Pass"]

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.25,
    random_state=42
)

model = LogisticRegression()
model.fit(X_train, y_train)

predictions = model.predict(X_test)

accuracy = accuracy_score(y_test, predictions)
matrix = confusion_matrix(y_test, predictions)

print("Predictions:", predictions)
print("Accuracy:", accuracy)
print("Confusion Matrix:")
print(matrix)

Project Explanation: The model learns from attendance and assignment marks to predict whether a student will pass.

6.18 Hands-On Activities

Activity 1: Regression

Create a Linear Regression model to predict salary based on years of experience.

Activity 2: Classification

Create a Logistic Regression model to predict Pass or Fail using attendance and marks.

Activity 3: Evaluation

Calculate accuracy, precision, recall and F1 score for a classification model.

Activity 4: Algorithm Comparison

Train Logistic Regression, Decision Tree and Random Forest on the same dataset and compare accuracy.

Mini Project

Build a student performance prediction system using a dataset with attendance, assignment marks, quiz marks and pass/fail result.

6.20 Chapter Summary

In this chapter, learners studied supervised learning, regression, classification, train-test split, evaluation metrics and popular supervised algorithms. Learners also built working Python models using Scikit-learn.

Remember: Regression predicts numbers, classification predicts categories, and proper evaluation is essential for building reliable predictive models.