Chapter 6: Supervised Learning Fundamentals

Understand regression and classification algorithms. Build predictive Machine Learning models using Python, evaluate performance, and visualize results.

RegressionClassificationPredictive ModelsEvaluationScikit-learn
Regression
Predict Numbers
Classification
Predict Classes
Model
Training
Evaluation
Metrics

6.1 Chapter Overview

Supervised Learning is a Machine Learning approach where a model learns from labelled data. Labelled data contains both input features and the correct output. The model studies the relationship between input and output, then predicts outcomes for new unseen data.

Learning Outcome: By the end of this chapter, learners should be able to explain regression and classification, build predictive models using Python, evaluate performance, and interpret model results.
1Collect Labelled Data
2Split Dataset
3Train Model
4Predict
5Evaluate

6.2 What is Supervised Learning?

Supervised Learning means the model is trained using examples where the correct answer is already known. The learning process is similar to a student learning from questions with answer keys.

ComponentMeaningExample
Features XInput values used for predictionStudy hours, attendance, assignments
Target yCorrect output or labelExam score or Pass/Fail
Training DataData used to teach the modelPast student records
Testing DataData used to evaluate model performanceNew unseen records

6.3 Regression vs Classification

TypePurposeOutputExample
RegressionPredict continuous numeric valuesNumberPredict exam marks, salary, sales
ClassificationPredict categories or labelsClassPass/Fail, Spam/Not Spam, Fraud/Normal

Regression Question

How many marks will the student score?

Classification Question

Will the student pass or fail?

6.4 Linear Regression

Linear Regression is used when the target value is numeric. It tries to find a straight-line relationship between input and output.

y = mx + b
SymbolMeaning
yPredicted output
xInput feature
mSlope of the line
bIntercept

Python Working Example: Predict Marks from Study Hours

from sklearn.linear_model import LinearRegression
import numpy as np

# X stores input features.
# Each value is placed inside another list because Scikit-learn expects 2D input.
X = np.array([[1], [2], [3], [4], [5]])

# y stores target values.
# These are the actual marks for each study hour value.
y = np.array([40, 50, 60, 70, 80])

# Create a Linear Regression model object.
model = LinearRegression()

# Train the model using input X and output y.
model.fit(X, y)

# Predict marks for a student who studies 6 hours.
prediction = model.predict([[6]])

print("Predicted Marks:", prediction[0])
Expected Output:
Predicted Marks: 90.0

Line-by-Line Explanation

CodeExplanation
from sklearn.linear_model import LinearRegressionImports the Linear Regression algorithm.
import numpy as npImports NumPy to create numerical arrays.
X = np.array([[1], [2], [3], [4], [5]])Creates the input feature dataset containing study hours.
y = np.array([40, 50, 60, 70, 80])Creates the target values containing marks.
model = LinearRegression()Creates the model.
model.fit(X, y)Trains the model to learn the relationship between hours and marks.
model.predict([[6]])Predicts marks for 6 study hours.

Expected Regression Graph

40
1h
50
2h
60
3h
70
4h
80
5h
90
6h Pred

Graph Meaning: As study hours increase, predicted marks also increase.

6.5 Regression Visualization with Matplotlib

import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
import numpy as np

X = np.array([[1], [2], [3], [4], [5]])
y = np.array([40, 50, 60, 70, 80])

model = LinearRegression()
model.fit(X, y)

plt.scatter(X, y, label="Actual Data")
plt.plot(X, model.predict(X), label="Regression Line")
plt.xlabel("Study Hours")
plt.ylabel("Marks")
plt.title("Study Hours vs Marks")
plt.legend()
plt.show()
Explanation: scatter() shows real data points. plot() draws the learned regression line.

6.6 Train-Test Split

Train-test split divides data into two parts. The model learns from training data and is evaluated using testing data.

Training Data = Model Learning | Testing Data = Model Checking
from sklearn.model_selection import train_test_split
import numpy as np

X = np.array([[1], [2], [3], [4], [5], [6], [7], [8]])
y = np.array([35, 45, 50, 60, 65, 75, 85, 95])

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.25,
    random_state=42
)

print("Training Features:", X_train)
print("Testing Features:", X_test)
print("Training Labels:", y_train)
print("Testing Labels:", y_test)
ParameterExplanation
test_size=0.2525% data is used for testing.
random_state=42Makes the split repeatable.

6.7 Regression Evaluation Metrics

Regression models are evaluated by comparing actual values with predicted values.

MAE = (1/n) Σ |y - ŷ|
MSE = (1/n) Σ (y - ŷ)²
RMSE = √MSE
R² Score = Measures how well the model explains the data
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

actual = [80, 70, 90, 60]
predicted = [78, 74, 88, 65]

mae = mean_absolute_error(actual, predicted)
mse = mean_squared_error(actual, predicted)
rmse = np.sqrt(mse)
r2 = r2_score(actual, predicted)

print("MAE:", mae)
print("MSE:", mse)
print("RMSE:", rmse)
print("R2 Score:", r2)
Expected Output:
MAE: 3.25
MSE: 12.25
RMSE: 3.5
R2 Score: 0.902

6.8 Classification Fundamentals

Classification predicts categories. Instead of predicting marks, it predicts labels such as Pass or Fail.

FeaturesTarget
Attendance, Assignment MarksPass / Fail
Email wordsSpam / Not Spam
Transaction amount and behaviorFraud / Normal

6.9 Logistic Regression for Classification

Logistic Regression is used for classification problems. It predicts probability and then assigns a class.

Output Probability between 0 and 1
from sklearn.linear_model import LogisticRegression
import numpy as np

# Features: attendance percentage
X = np.array([[90], [85], [45], [50], [95], [40]])

# Labels: 1 = Pass, 0 = Fail
y = np.array([1, 1, 0, 0, 1, 0])

model = LogisticRegression()

model.fit(X, y)

prediction = model.predict([[88]])

if prediction[0] == 1:
    print("Prediction: Pass")
else:
    print("Prediction: Fail")
Expected Output:
Prediction: Pass
CodeExplanation
X = np.array([[90], ...])Stores attendance values as model input.
y = np.array([1, 1, 0...])Stores pass/fail labels.
model.fit(X, y)Trains the classifier.
model.predict([[88]])Predicts whether 88% attendance is Pass or Fail.

6.10 Classification with Two Features

Most real classification problems use more than one feature. This example uses attendance and assignment marks.

from sklearn.linear_model import LogisticRegression
import numpy as np

# Features: [attendance, assignment_marks]
X = np.array([
    [90, 85],
    [80, 78],
    [45, 40],
    [55, 50],
    [95, 90],
    [35, 30]
])

# Labels: 1 = Pass, 0 = Fail
y = np.array([1, 1, 0, 0, 1, 0])

model = LogisticRegression()
model.fit(X, y)

new_student = [[88, 82]]

prediction = model.predict(new_student)

print("Prediction:", prediction[0])
Expected Output:
Prediction: 1
Meaning: The student is predicted to Pass.

6.11 Confusion Matrix

A confusion matrix compares actual labels with predicted labels.

Predicted PassPredicted Fail
Actual PassTrue PositiveFalse Negative
Actual FailFalse PositiveTrue Negative
from sklearn.metrics import confusion_matrix

actual = [1, 1, 0, 0, 1, 0]
predicted = [1, 1, 0, 1, 1, 0]

matrix = confusion_matrix(actual, predicted)

print(matrix)
Expected Output:
[[2 1]
[0 3]]

6.12 Classification Evaluation Metrics

Accuracy = (TP + TN) / (TP + TN + FP + FN)
Precision = TP / (TP + FP)
Recall = TP / (TP + FN)
F1 Score = 2 × (Precision × Recall) / (Precision + Recall)
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

actual = [1, 1, 0, 0, 1, 0]
predicted = [1, 1, 0, 1, 1, 0]

print("Accuracy:", accuracy_score(actual, predicted))
print("Precision:", precision_score(actual, predicted))
print("Recall:", recall_score(actual, predicted))
print("F1 Score:", f1_score(actual, predicted))

6.13 Popular Supervised Learning Algorithms

AlgorithmTypeUse
Linear RegressionRegressionPredict numeric values
Logistic RegressionClassificationPredict binary classes
Decision TreeBothRule-based prediction
K-Nearest NeighborsClassification / RegressionPredict based on nearby examples
Support Vector MachineClassificationFind separating boundary
Random ForestBothCombines many decision trees

6.14 Decision Tree Classifier

from sklearn.tree import DecisionTreeClassifier

X = [
    [90, 85],
    [80, 78],
    [45, 40],
    [55, 50],
    [95, 90],
    [35, 30]
]

y = [1, 1, 0, 0, 1, 0]

model = DecisionTreeClassifier()
model.fit(X, y)

prediction = model.predict([[75, 70]])

print("Prediction:", prediction[0])
Explanation: A decision tree learns rules such as: if attendance is high and marks are high, predict Pass.

6.15 K-Nearest Neighbors

KNN predicts based on the most similar nearby records.

from sklearn.neighbors import KNeighborsClassifier

X = [
    [90, 85],
    [80, 78],
    [45, 40],
    [55, 50],
    [95, 90],
    [35, 30]
]

y = [1, 1, 0, 0, 1, 0]

model = KNeighborsClassifier(n_neighbors=3)
model.fit(X, y)

prediction = model.predict([[78, 74]])

print("Prediction:", prediction[0])
Explanation: n_neighbors=3 means the model checks the 3 nearest records and uses majority voting.

6.16 Random Forest Classifier

Random Forest combines many decision trees to produce a stronger model.

from sklearn.ensemble import RandomForestClassifier

X = [
    [90, 85],
    [80, 78],
    [45, 40],
    [55, 50],
    [95, 90],
    [35, 30]
]

y = [1, 1, 0, 0, 1, 0]

model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X, y)

prediction = model.predict([[88, 80]])

print("Prediction:", prediction[0])

6.17 Complete Mini Project: Student Performance Prediction

This project combines data preparation, train-test split, model training, prediction and evaluation.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix

data = {
    "Attendance": [90, 80, 45, 55, 95, 35, 85, 60],
    "Assignment": [85, 78, 40, 50, 90, 30, 82, 58],
    "Pass": [1, 1, 0, 0, 1, 0, 1, 0]
}

df = pd.DataFrame(data)

X = df[["Attendance", "Assignment"]]
y = df["Pass"]

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.25,
    random_state=42
)

model = LogisticRegression()
model.fit(X_train, y_train)

predictions = model.predict(X_test)

accuracy = accuracy_score(y_test, predictions)
matrix = confusion_matrix(y_test, predictions)

print("Predictions:", predictions)
print("Accuracy:", accuracy)
print("Confusion Matrix:")
print(matrix)
Project Explanation: The model learns from attendance and assignment marks to predict whether a student will pass.

6.18 Hands-On Activities

Activity 1: Regression

Create a Linear Regression model to predict salary based on years of experience.

Activity 2: Classification

Create a Logistic Regression model to predict Pass or Fail using attendance and marks.

Activity 3: Evaluation

Calculate accuracy, precision, recall and F1 score for a classification model.

Activity 4: Algorithm Comparison

Train Logistic Regression, Decision Tree and Random Forest on the same dataset and compare accuracy.

Mini Project

Build a student performance prediction system using a dataset with attendance, assignment marks, quiz marks and pass/fail result.

6.19 Interactive Final Assessment Quiz

Each correct answer gives +1 mark. Each wrong answer gives -0.5 mark.

1. Supervised Learning uses labelled data.

2. Regression predicts:

3. Classification predicts categories or labels.

4. Linear Regression is commonly used for:

5. Logistic Regression is commonly used for classification.

6. Train-test split helps evaluate the model on unseen data.

7. Which metric is used for regression error?

8. A confusion matrix is used for classification evaluation.

9. Random Forest combines multiple decision trees.

10. KNN predicts based on nearest examples.

Your Score: 0

6.20 Chapter Summary

In this chapter, learners studied supervised learning, regression, classification, train-test split, evaluation metrics and popular supervised algorithms. Learners also built working Python models using Scikit-learn.

Remember: Regression predicts numbers, classification predicts categories, and proper evaluation is essential for building reliable predictive models.