Chapter 5: Statistics & Mathematics for Machine Learning
Learn probability, statistics, linear algebra and calculus concepts required to understand Machine Learning algorithms, model evaluation and optimization.
Uncertainty
Patterns
Data Shape
Optimization
5.1 Chapter Overview
Machine Learning is built on mathematics. Probability helps measure uncertainty, statistics helps summarize and interpret data, linear algebra helps represent data using vectors and matrices, and calculus helps optimize models during training.
This chapter explains the mathematical foundation of ML in a beginner-friendly way. The aim is not to memorize formulas only, but to understand why each concept matters in real ML workflows.
5.2 Learning Objectives
- Explain probability concepts used in Machine Learning.
- Understand descriptive and inferential statistics.
- Apply formulas such as mean, variance, probability and MSE.
- Understand correlation, covariance, confidence intervals and p-values.
- Recognize common statistical tests used in data analysis.
- Understand vectors, matrices, distance measures and similarity measures.
- Explain gradients, derivatives and gradient descent in model training.
5.3 Probability for Machine Learning
Probability measures how likely an event is to happen. In ML, probability is used when models make predictions under uncertainty, such as a student having an 85% chance of passing.
| Concept | Meaning | Example |
|---|---|---|
| Sample Space | All possible outcomes. | Head, Tail |
| Event | A specific outcome. | Getting Head |
| Independent Events | One event does not affect another. | Two coin tosses |
| Dependent Events | One event affects another. | Drawing cards without replacement |
| Mutually Exclusive Events | Events cannot happen together. | Pass and Fail |
Probability Rules
| Rule | Formula | Use |
|---|---|---|
| Complement | P(not A)=1-P(A) | Event does not happen |
| Addition | P(A or B)=P(A)+P(B)-P(A and B) | Either event happens |
| Multiplication | P(A and B)=P(A)×P(B) | Both independent events happen |
5.4 Joint, Conditional and Marginal Probability
| Type | Meaning | Example |
|---|---|---|
| Joint Probability | Two events happen together. | High attendance and pass |
| Conditional Probability | Probability of A given B happened. | Pass given high attendance |
| Marginal Probability | Probability of one event regardless of others. | Overall pass probability |
If 40 students have high attendance and 35 pass, then P(Pass | High Attendance) = 35/40 = 0.875 or 87.5%.
5.5 Bayes' Theorem
Bayes' theorem updates probability when new evidence is available. It is used in spam filtering, medical diagnosis, risk prediction and Naive Bayes classifiers.
| Term | Meaning |
|---|---|
| P(A) | Prior probability |
| P(B|A) | Likelihood |
| P(B) | Total probability of evidence |
| P(A|B) | Updated probability |
5.6 Probability Distributions
A probability distribution describes how values are spread across possible outcomes. ML uses distributions to understand data behavior and uncertainty.
| Distribution | Use | Example |
|---|---|---|
| Normal | Bell-shaped continuous data | Exam scores |
| Binomial | Success/failure outcomes | Pass or fail |
| Poisson | Counts of events | Website visits per minute |
| t-Distribution | Small sample inference | Small class performance |
Sampling distributions show how a statistic such as a sample mean varies across repeated samples.
5.7 Statistics for Machine Learning
Statistics helps summarize, analyze and interpret data. It is essential for understanding data patterns, evaluating models and making data-driven decisions.
Descriptive Statistics
| Measure | Meaning | ML Use |
|---|---|---|
| Mean | Average value | Average marks or sales |
| Median | Middle value | Useful with outliers |
| Mode | Most frequent value | Most common category |
| Variance | Average squared spread | Measures variability |
| Standard Deviation | Typical distance from mean | Shows spread |
import numpy as np
marks = [60, 70, 80, 90, 100]
print("Mean:", np.mean(marks))
print("Median:", np.median(marks))
print("Variance:", np.var(marks))
print("Standard Deviation:", np.std(marks))5.8 Inferential Statistics
| Concept | Meaning | Example |
|---|---|---|
| Confidence Interval | Range likely to contain true population value | Average score between 70 and 78 |
| Central Limit Theorem | Sample means tend toward normal distribution | Repeated class samples |
| P-Value | Probability of result under null hypothesis | Significance testing |
| Hypothesis Testing | Tests assumptions using evidence | New teaching method improves results? |
5.9 Correlation, Covariance, Skewness and Kurtosis
| Concept | Meaning | ML Use |
|---|---|---|
| Covariance | Whether two variables move together | Attendance and marks |
| Correlation | Standardized relationship from -1 to +1 | Feature relationship analysis |
| Skewness | Asymmetry of distribution | Detects imbalance |
| Kurtosis | Tail heaviness | Detects extreme behavior |
5.10 Hypothesis Testing and Parametric Tests
Hypothesis testing checks whether a claim about data is supported by evidence.
| Test | Purpose | Example |
|---|---|---|
| Z-Test | Compare sample mean with population mean when sample is large | Large exam dataset |
| T-Test | Compare means when sample is small | Compare two small classes |
| F-Test | Compare variances | Variation between groups |
| Chi-Square Test | Relationship between categorical variables | Course type and pass/fail |
5.11 Bias-Variance Tradeoff, MLE and MSE
| Concept | Meaning | ML Importance |
|---|---|---|
| Bias | Error from overly simple assumptions | High bias can cause underfitting |
| Variance | Error from sensitivity to training data | High variance can cause overfitting |
| Bias-Variance Tradeoff | Balancing simplicity and flexibility | Improves generalization |
| Maximum Likelihood Estimation | Finds parameters that make observed data most likely | Used in statistical models |
| Mean Squared Error | Average squared prediction error | Regression model error |
5.12 Linear Algebra for Machine Learning
Linear algebra represents and manipulates data using vectors and matrices. Many ML algorithms depend on vector and matrix operations.
| Concept | Meaning | ML Use |
|---|---|---|
| Vector | A list of numbers | One data record |
| Matrix | A rectangular table of numbers | Dataset with rows and columns |
| Dot Product | Combines two vectors into a number | Neural networks and similarity |
| Eigenvalues / Eigenvectors | Important directions in transformations | PCA and dimensionality reduction |
| SVD | Matrix factorization | Compression and recommendation systems |
import numpy as np student_vector = np.array([85, 76, 4]) dataset_matrix = np.array([[85, 76, 4],[70, 88, 3],[90, 92, 5]]) print(student_vector) print(dataset_matrix)
5.13 Distance and Similarity Measures
| Measure | Meaning | Use |
|---|---|---|
| Euclidean Distance | Straight-line distance | KNN, clustering |
| Manhattan Distance | Sum of absolute differences | Grid-like distance |
| Cosine Similarity | Angle similarity | Text and recommendations |
| Jaccard Similarity | Intersection divided by union | Set and text similarity |
| Orthogonality | Vectors at right angles | Independent directions |
| Projection | Mapping one vector onto another | Dimensionality reduction |
5.14 Calculus for Machine Learning
Calculus is used to optimize Machine Learning models. During training, models adjust parameters to reduce error. Derivatives and gradients show direction and rate of change.
| Concept | Meaning | ML Use |
|---|---|---|
| Derivative | Rate of change of a function | How loss changes |
| Gradient | Vector of partial derivatives | Direction to update parameters |
| Higher-Order Derivatives | Derivatives of derivatives | Curvature analysis |
| Multivariable Calculus | Calculus with many inputs | Models with many features |
| Chain Rule | Derivative of composite functions | Backpropagation |
| Jacobian Matrix | Matrix of first-order partial derivatives | Vector-valued functions |
| Hessian Matrix | Matrix of second-order partial derivatives | Advanced optimization |
5.15 Gradient Descent and Stochastic Gradient Descent
Gradient descent minimizes a loss function by updating parameters step by step in the direction that reduces error.
| Method | Meaning | Use |
|---|---|---|
| Gradient Descent | Uses the full dataset | Stable but may be slow |
| Stochastic Gradient Descent | Uses one sample at a time | Faster but noisier |
| Mini-Batch Gradient Descent | Uses small batches | Common in deep learning |
5.16 Practical Python Examples
Mean Squared Error Example
actual = [80, 70, 90]
predicted = [78, 74, 88]
errors = []
for i in range(len(actual)):
error = (actual[i] - predicted[i]) ** 2
errors.append(error)
mse = sum(errors) / len(errors)
print("MSE:", mse)MSE: 8.0
Euclidean Distance Example
import math
student_a = [85, 76]
student_b = [90, 80]
distance = math.sqrt((student_a[0]-student_b[0])**2 + (student_a[1]-student_b[1])**2)
print("Distance:", distance)Correlation Example
import numpy as np attendance = [60, 70, 80, 90, 95] marks = [55, 65, 75, 88, 92] correlation = np.corrcoef(attendance, marks) print(correlation)
5.17 Hands-On Practice Activities
Activity 1: Descriptive Statistics
Create a Python list of 10 marks and calculate mean, median, variance and standard deviation.
Activity 2: Probability
Calculate the probability of passing if 35 out of 50 students passed.
Activity 3: Distance Measure
Calculate Euclidean distance between two student records using attendance and marks.
Activity 4: MSE
Calculate Mean Squared Error between actual and predicted marks.
Mini Project: ML Math Report
Create a short analysis report using student marks to calculate descriptive statistics, correlation, MSE and distance between two students.
5.18 Interactive Final Assessment Quiz
Each correct answer gives +1 mark.
Each wrong answer gives -0.5 mark.
1. Probability helps measure uncertainty in data.
2. Which measure represents the average value?
3. Standard deviation measures data spread.
4. Bayes' theorem updates probability using evidence.
5. Which test is commonly used for small sample means?
6. Vectors and matrices are part of linear algebra.
7. Euclidean distance measures straight-line distance.
8. MSE measures prediction error in regression.
9. Gradient descent is used to minimize loss functions.
10. The chain rule is important in neural network backpropagation.
Your Score: 0
5.19 Chapter Summary
In this chapter, learners studied probability, statistics, linear algebra and calculus for Machine Learning. These topics support uncertainty modelling, data interpretation, hypothesis testing, distance measurement, similarity comparison, error calculation and model optimization.