Chapter 5 - Statistics & Mathematics for Machine Learning

5.1 Chapter Overview

Machine Learning is built on mathematics. Probability helps measure uncertainty, statistics helps summarize and interpret data, linear algebra helps represent data using vectors and matrices, and calculus helps optimize models during training.

This chapter explains the mathematical foundation of ML in a beginner-friendly way. The aim is not to memorize formulas only, but to understand why each concept matters in real ML workflows.

Learning Outcome: Learners should be able to understand uncertainty, interpret data patterns, compare variables, measure model error, understand data representation and explain optimization logic.

5.2 Learning Objectives

Explain probability concepts used in Machine Learning.
Understand descriptive and inferential statistics.
Apply formulas such as mean, variance, probability and MSE.
Understand correlation, covariance, confidence intervals and p-values.
Recognize common statistical tests used in data analysis.
Understand vectors, matrices, distance measures and similarity measures.
Explain gradients, derivatives and gradient descent in model training.

5.3 Probability for Machine Learning

Probability measures how likely an event is to happen. In ML, probability is used when models make predictions under uncertainty, such as a student having an 85% chance of passing.

P(A) = Favourable Outcomes / Total Outcomes

Concept	Meaning	Example
Sample Space	All possible outcomes.	Head, Tail
Event	A specific outcome.	Getting Head
Independent Events	One event does not affect another.	Two coin tosses
Dependent Events	One event affects another.	Drawing cards without replacement
Mutually Exclusive Events	Events cannot happen together.	Pass and Fail

Probability Rules

Rule	Formula	Use
Complement	P(not A)=1-P(A)	Event does not happen
Addition	P(A or B)=P(A)+P(B)-P(A and B)	Either event happens
Multiplication	P(A and B)=P(A)×P(B)	Both independent events happen

5.4 Joint, Conditional and Marginal Probability

Type	Meaning	Example
Joint Probability	Two events happen together.	High attendance and pass
Conditional Probability	Probability of A given B happened.	Pass given high attendance
Marginal Probability	Probability of one event regardless of others.	Overall pass probability

P(A | B) = P(A ∩ B) / P(B)

If 40 students have high attendance and 35 pass, then P(Pass | High Attendance) = 35/40 = 0.875 or 87.5%.

5.5 Bayes' Theorem

Bayes' theorem updates probability when new evidence is available. It is used in spam filtering, medical diagnosis, risk prediction and Naive Bayes classifiers.

P(A | B) = [P(B | A) × P(A)] / P(B)

Term	Meaning
P(A)	Prior probability
P(B\|A)	Likelihood
P(B)	Total probability of evidence
P(A\|B)	Updated probability

5.6 Probability Distributions

A probability distribution describes how values are spread across possible outcomes. ML uses distributions to understand data behavior and uncertainty.

Distribution	Use	Example
Normal	Bell-shaped continuous data	Exam scores
Binomial	Success/failure outcomes	Pass or fail
Poisson	Counts of events	Website visits per minute
t-Distribution	Small sample inference	Small class performance

Sampling distributions show how a statistic such as a sample mean varies across repeated samples.

5.7 Statistics for Machine Learning

Statistics helps summarize, analyze and interpret data. It is essential for understanding data patterns, evaluating models and making data-driven decisions.

Descriptive Statistics

Measure	Meaning	ML Use
Mean	Average value	Average marks or sales
Median	Middle value	Useful with outliers
Mode	Most frequent value	Most common category
Variance	Average squared spread	Measures variability
Standard Deviation	Typical distance from mean	Shows spread

Variance = Σ(x - mean)² / n

Standard Deviation = √Variance

import numpy as np
marks = [60, 70, 80, 90, 100]
print("Mean:", np.mean(marks))
print("Median:", np.median(marks))
print("Variance:", np.var(marks))
print("Standard Deviation:", np.std(marks))

5.8 Inferential Statistics

Concept	Meaning	Example
Confidence Interval	Range likely to contain true population value	Average score between 70 and 78
Central Limit Theorem	Sample means tend toward normal distribution	Repeated class samples
P-Value	Probability of result under null hypothesis	Significance testing
Hypothesis Testing	Tests assumptions using evidence	New teaching method improves results?

Simple Meaning: Inferential statistics helps decide whether a pattern is meaningful or random chance.

5.9 Correlation, Covariance, Skewness and Kurtosis

Concept	Meaning	ML Use
Covariance	Whether two variables move together	Attendance and marks
Correlation	Standardized relationship from -1 to +1	Feature relationship analysis
Skewness	Asymmetry of distribution	Detects imbalance
Kurtosis	Tail heaviness	Detects extreme behavior

Correlation near +1 = strong positive relationship

Correlation near -1 = strong negative relationship

Correlation near 0 = weak or no linear relationship

5.10 Hypothesis Testing and Parametric Tests

Hypothesis testing checks whether a claim about data is supported by evidence.

Test	Purpose	Example
Z-Test	Compare sample mean with population mean when sample is large	Large exam dataset
T-Test	Compare means when sample is small	Compare two small classes
F-Test	Compare variances	Variation between groups
Chi-Square Test	Relationship between categorical variables	Course type and pass/fail

If p-value < 0.05, result is commonly considered statistically significant.

5.11 Bias-Variance Tradeoff, MLE and MSE

Concept	Meaning	ML Importance
Bias	Error from overly simple assumptions	High bias can cause underfitting
Variance	Error from sensitivity to training data	High variance can cause overfitting
Bias-Variance Tradeoff	Balancing simplicity and flexibility	Improves generalization
Maximum Likelihood Estimation	Finds parameters that make observed data most likely	Used in statistical models
Mean Squared Error	Average squared prediction error	Regression model error

MSE = (1/n) Σ(yᵢ - ŷᵢ)²

5.12 Linear Algebra for Machine Learning

Linear algebra represents and manipulates data using vectors and matrices. Many ML algorithms depend on vector and matrix operations.

Concept	Meaning	ML Use
Vector	A list of numbers	One data record
Matrix	A rectangular table of numbers	Dataset with rows and columns
Dot Product	Combines two vectors into a number	Neural networks and similarity
Eigenvalues / Eigenvectors	Important directions in transformations	PCA and dimensionality reduction
SVD	Matrix factorization	Compression and recommendation systems

import numpy as np
student_vector = np.array([85, 76, 4])
dataset_matrix = np.array([[85, 76, 4],[70, 88, 3],[90, 92, 5]])
print(student_vector)
print(dataset_matrix)

5.13 Distance and Similarity Measures

Measure	Meaning	Use
Euclidean Distance	Straight-line distance	KNN, clustering
Manhattan Distance	Sum of absolute differences	Grid-like distance
Cosine Similarity	Angle similarity	Text and recommendations
Jaccard Similarity	Intersection divided by union	Set and text similarity
Orthogonality	Vectors at right angles	Independent directions
Projection	Mapping one vector onto another	Dimensionality reduction

Euclidean Distance = √Σ(xᵢ - yᵢ)²

Manhattan Distance = Σ|xᵢ - yᵢ|

Cosine Similarity = (A · B) / (||A|| ||B||)

5.14 Calculus for Machine Learning

Calculus is used to optimize Machine Learning models. During training, models adjust parameters to reduce error. Derivatives and gradients show direction and rate of change.

Concept	Meaning	ML Use
Derivative	Rate of change of a function	How loss changes
Gradient	Vector of partial derivatives	Direction to update parameters
Higher-Order Derivatives	Derivatives of derivatives	Curvature analysis
Multivariable Calculus	Calculus with many inputs	Models with many features
Chain Rule	Derivative of composite functions	Backpropagation
Jacobian Matrix	Matrix of first-order partial derivatives	Vector-valued functions
Hessian Matrix	Matrix of second-order partial derivatives	Advanced optimization

5.15 Gradient Descent and Stochastic Gradient Descent

Gradient descent minimizes a loss function by updating parameters step by step in the direction that reduces error.

New Parameter = Old Parameter - Learning Rate × Gradient

Method	Meaning	Use
Gradient Descent	Uses the full dataset	Stable but may be slow
Stochastic Gradient Descent	Uses one sample at a time	Faster but noisier
Mini-Batch Gradient Descent	Uses small batches	Common in deep learning

Simple Meaning: Gradient descent is like walking downhill step by step until reaching the lowest error point.

5.16 Practical Python Examples

Mean Squared Error Example

actual = [80, 70, 90]
predicted = [78, 74, 88]
errors = []
for i in range(len(actual)):
    error = (actual[i] - predicted[i]) ** 2
    errors.append(error)
mse = sum(errors) / len(errors)
print("MSE:", mse)

Output:
MSE: 8.0

Euclidean Distance Example

import math
student_a = [85, 76]
student_b = [90, 80]
distance = math.sqrt((student_a[0]-student_b[0])**2 + (student_a[1]-student_b[1])**2)
print("Distance:", distance)

Correlation Example

import numpy as np
attendance = [60, 70, 80, 90, 95]
marks = [55, 65, 75, 88, 92]
correlation = np.corrcoef(attendance, marks)
print(correlation)

5.17 Hands-On Practice Activities

Activity 1: Descriptive Statistics

Create a Python list of 10 marks and calculate mean, median, variance and standard deviation.

Activity 2: Probability

Calculate the probability of passing if 35 out of 50 students passed.

Activity 3: Distance Measure

Calculate Euclidean distance between two student records using attendance and marks.

Activity 4: MSE

Calculate Mean Squared Error between actual and predicted marks.

Mini Project: ML Math Report

Create a short analysis report using student marks to calculate descriptive statistics, correlation, MSE and distance between two students.

5.19 Chapter Summary

In this chapter, learners studied probability, statistics, linear algebra and calculus for Machine Learning. These topics support uncertainty modelling, data interpretation, hypothesis testing, distance measurement, similarity comparison, error calculation and model optimization.

Remember: Machine Learning is not only coding. Strong mathematical understanding helps learners interpret models, improve accuracy and make better data-driven decisions.