Chapter 6: Supervised Learning Fundamentals
Understand regression and classification algorithms. Build predictive Machine Learning models using Python, evaluate performance, and visualize results.
Predict Numbers
Predict Classes
Training
Metrics
6.1 Chapter Overview
Supervised Learning is a Machine Learning approach where a model learns from labelled data. Labelled data contains both input features and the correct output. The model studies the relationship between input and output, then predicts outcomes for new unseen data.
6.2 What is Supervised Learning?
Supervised Learning means the model is trained using examples where the correct answer is already known. The learning process is similar to a student learning from questions with answer keys.
| Component | Meaning | Example |
|---|---|---|
| Features X | Input values used for prediction | Study hours, attendance, assignments |
| Target y | Correct output or label | Exam score or Pass/Fail |
| Training Data | Data used to teach the model | Past student records |
| Testing Data | Data used to evaluate model performance | New unseen records |
6.3 Regression vs Classification
| Type | Purpose | Output | Example |
|---|---|---|---|
| Regression | Predict continuous numeric values | Number | Predict exam marks, salary, sales |
| Classification | Predict categories or labels | Class | Pass/Fail, Spam/Not Spam, Fraud/Normal |
Regression Question
How many marks will the student score?
Classification Question
Will the student pass or fail?
6.4 Linear Regression
Linear Regression is used when the target value is numeric. It tries to find a straight-line relationship between input and output.
| Symbol | Meaning |
|---|---|
| y | Predicted output |
| x | Input feature |
| m | Slope of the line |
| b | Intercept |
Python Working Example: Predict Marks from Study Hours
from sklearn.linear_model import LinearRegression
import numpy as np
# X stores input features.
# Each value is placed inside another list because Scikit-learn expects 2D input.
X = np.array([[1], [2], [3], [4], [5]])
# y stores target values.
# These are the actual marks for each study hour value.
y = np.array([40, 50, 60, 70, 80])
# Create a Linear Regression model object.
model = LinearRegression()
# Train the model using input X and output y.
model.fit(X, y)
# Predict marks for a student who studies 6 hours.
prediction = model.predict([[6]])
print("Predicted Marks:", prediction[0])Predicted Marks: 90.0
Line-by-Line Explanation
| Code | Explanation |
|---|---|
| from sklearn.linear_model import LinearRegression | Imports the Linear Regression algorithm. |
| import numpy as np | Imports NumPy to create numerical arrays. |
| X = np.array([[1], [2], [3], [4], [5]]) | Creates the input feature dataset containing study hours. |
| y = np.array([40, 50, 60, 70, 80]) | Creates the target values containing marks. |
| model = LinearRegression() | Creates the model. |
| model.fit(X, y) | Trains the model to learn the relationship between hours and marks. |
| model.predict([[6]]) | Predicts marks for 6 study hours. |
Expected Regression Graph
Graph Meaning: As study hours increase, predicted marks also increase.
6.5 Regression Visualization with Matplotlib
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
import numpy as np
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([40, 50, 60, 70, 80])
model = LinearRegression()
model.fit(X, y)
plt.scatter(X, y, label="Actual Data")
plt.plot(X, model.predict(X), label="Regression Line")
plt.xlabel("Study Hours")
plt.ylabel("Marks")
plt.title("Study Hours vs Marks")
plt.legend()
plt.show()6.6 Train-Test Split
Train-test split divides data into two parts. The model learns from training data and is evaluated using testing data.
from sklearn.model_selection import train_test_split
import numpy as np
X = np.array([[1], [2], [3], [4], [5], [6], [7], [8]])
y = np.array([35, 45, 50, 60, 65, 75, 85, 95])
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.25,
random_state=42
)
print("Training Features:", X_train)
print("Testing Features:", X_test)
print("Training Labels:", y_train)
print("Testing Labels:", y_test)| Parameter | Explanation |
|---|---|
| test_size=0.25 | 25% data is used for testing. |
| random_state=42 | Makes the split repeatable. |
6.7 Regression Evaluation Metrics
Regression models are evaluated by comparing actual values with predicted values.
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np
actual = [80, 70, 90, 60]
predicted = [78, 74, 88, 65]
mae = mean_absolute_error(actual, predicted)
mse = mean_squared_error(actual, predicted)
rmse = np.sqrt(mse)
r2 = r2_score(actual, predicted)
print("MAE:", mae)
print("MSE:", mse)
print("RMSE:", rmse)
print("R2 Score:", r2)MAE: 3.25
MSE: 12.25
RMSE: 3.5
R2 Score: 0.902
6.8 Classification Fundamentals
Classification predicts categories. Instead of predicting marks, it predicts labels such as Pass or Fail.
| Features | Target |
|---|---|
| Attendance, Assignment Marks | Pass / Fail |
| Email words | Spam / Not Spam |
| Transaction amount and behavior | Fraud / Normal |
6.9 Logistic Regression for Classification
Logistic Regression is used for classification problems. It predicts probability and then assigns a class.
from sklearn.linear_model import LogisticRegression
import numpy as np
# Features: attendance percentage
X = np.array([[90], [85], [45], [50], [95], [40]])
# Labels: 1 = Pass, 0 = Fail
y = np.array([1, 1, 0, 0, 1, 0])
model = LogisticRegression()
model.fit(X, y)
prediction = model.predict([[88]])
if prediction[0] == 1:
print("Prediction: Pass")
else:
print("Prediction: Fail")Prediction: Pass
| Code | Explanation |
|---|---|
| X = np.array([[90], ...]) | Stores attendance values as model input. |
| y = np.array([1, 1, 0...]) | Stores pass/fail labels. |
| model.fit(X, y) | Trains the classifier. |
| model.predict([[88]]) | Predicts whether 88% attendance is Pass or Fail. |
6.10 Classification with Two Features
Most real classification problems use more than one feature. This example uses attendance and assignment marks.
from sklearn.linear_model import LogisticRegression
import numpy as np
# Features: [attendance, assignment_marks]
X = np.array([
[90, 85],
[80, 78],
[45, 40],
[55, 50],
[95, 90],
[35, 30]
])
# Labels: 1 = Pass, 0 = Fail
y = np.array([1, 1, 0, 0, 1, 0])
model = LogisticRegression()
model.fit(X, y)
new_student = [[88, 82]]
prediction = model.predict(new_student)
print("Prediction:", prediction[0])Prediction: 1
Meaning: The student is predicted to Pass.
6.11 Confusion Matrix
A confusion matrix compares actual labels with predicted labels.
| Predicted Pass | Predicted Fail | |
|---|---|---|
| Actual Pass | True Positive | False Negative |
| Actual Fail | False Positive | True Negative |
from sklearn.metrics import confusion_matrix actual = [1, 1, 0, 0, 1, 0] predicted = [1, 1, 0, 1, 1, 0] matrix = confusion_matrix(actual, predicted) print(matrix)
[[2 1]
[0 3]]
6.12 Classification Evaluation Metrics
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
actual = [1, 1, 0, 0, 1, 0]
predicted = [1, 1, 0, 1, 1, 0]
print("Accuracy:", accuracy_score(actual, predicted))
print("Precision:", precision_score(actual, predicted))
print("Recall:", recall_score(actual, predicted))
print("F1 Score:", f1_score(actual, predicted))6.13 Popular Supervised Learning Algorithms
| Algorithm | Type | Use |
|---|---|---|
| Linear Regression | Regression | Predict numeric values |
| Logistic Regression | Classification | Predict binary classes |
| Decision Tree | Both | Rule-based prediction |
| K-Nearest Neighbors | Classification / Regression | Predict based on nearby examples |
| Support Vector Machine | Classification | Find separating boundary |
| Random Forest | Both | Combines many decision trees |
6.14 Decision Tree Classifier
from sklearn.tree import DecisionTreeClassifier
X = [
[90, 85],
[80, 78],
[45, 40],
[55, 50],
[95, 90],
[35, 30]
]
y = [1, 1, 0, 0, 1, 0]
model = DecisionTreeClassifier()
model.fit(X, y)
prediction = model.predict([[75, 70]])
print("Prediction:", prediction[0])6.15 K-Nearest Neighbors
KNN predicts based on the most similar nearby records.
from sklearn.neighbors import KNeighborsClassifier
X = [
[90, 85],
[80, 78],
[45, 40],
[55, 50],
[95, 90],
[35, 30]
]
y = [1, 1, 0, 0, 1, 0]
model = KNeighborsClassifier(n_neighbors=3)
model.fit(X, y)
prediction = model.predict([[78, 74]])
print("Prediction:", prediction[0])6.16 Random Forest Classifier
Random Forest combines many decision trees to produce a stronger model.
from sklearn.ensemble import RandomForestClassifier
X = [
[90, 85],
[80, 78],
[45, 40],
[55, 50],
[95, 90],
[35, 30]
]
y = [1, 1, 0, 0, 1, 0]
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X, y)
prediction = model.predict([[88, 80]])
print("Prediction:", prediction[0])6.17 Complete Mini Project: Student Performance Prediction
This project combines data preparation, train-test split, model training, prediction and evaluation.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix
data = {
"Attendance": [90, 80, 45, 55, 95, 35, 85, 60],
"Assignment": [85, 78, 40, 50, 90, 30, 82, 58],
"Pass": [1, 1, 0, 0, 1, 0, 1, 0]
}
df = pd.DataFrame(data)
X = df[["Attendance", "Assignment"]]
y = df["Pass"]
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.25,
random_state=42
)
model = LogisticRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
matrix = confusion_matrix(y_test, predictions)
print("Predictions:", predictions)
print("Accuracy:", accuracy)
print("Confusion Matrix:")
print(matrix)6.18 Hands-On Activities
Activity 1: Regression
Create a Linear Regression model to predict salary based on years of experience.
Activity 2: Classification
Create a Logistic Regression model to predict Pass or Fail using attendance and marks.
Activity 3: Evaluation
Calculate accuracy, precision, recall and F1 score for a classification model.
Activity 4: Algorithm Comparison
Train Logistic Regression, Decision Tree and Random Forest on the same dataset and compare accuracy.
Mini Project
Build a student performance prediction system using a dataset with attendance, assignment marks, quiz marks and pass/fail result.
6.19 Interactive Final Assessment Quiz
Each correct answer gives +1 mark. Each wrong answer gives -0.5 mark.
1. Supervised Learning uses labelled data.
2. Regression predicts:
3. Classification predicts categories or labels.
4. Linear Regression is commonly used for:
5. Logistic Regression is commonly used for classification.
6. Train-test split helps evaluate the model on unseen data.
7. Which metric is used for regression error?
8. A confusion matrix is used for classification evaluation.
9. Random Forest combines multiple decision trees.
10. KNN predicts based on nearest examples.
Your Score: 0
6.20 Chapter Summary
In this chapter, learners studied supervised learning, regression, classification, train-test split, evaluation metrics and popular supervised algorithms. Learners also built working Python models using Scikit-learn.