Chapter 5: Statistics & Mathematics for Machine Learning

Learn probability, statistics, linear algebra and calculus concepts required to understand Machine Learning algorithms, model evaluation and optimization.

ProbabilityStatisticsLinear AlgebraCalculusOptimization
Probability
Uncertainty
Statistics
Patterns
Linear Algebra
Data Shape
Calculus
Optimization

5.1 Chapter Overview

Machine Learning is built on mathematics. Probability helps measure uncertainty, statistics helps summarize and interpret data, linear algebra helps represent data using vectors and matrices, and calculus helps optimize models during training.

This chapter explains the mathematical foundation of ML in a beginner-friendly way. The aim is not to memorize formulas only, but to understand why each concept matters in real ML workflows.

Learning Outcome: Learners should be able to understand uncertainty, interpret data patterns, compare variables, measure model error, understand data representation and explain optimization logic.

5.2 Learning Objectives

  • Explain probability concepts used in Machine Learning.
  • Understand descriptive and inferential statistics.
  • Apply formulas such as mean, variance, probability and MSE.
  • Understand correlation, covariance, confidence intervals and p-values.
  • Recognize common statistical tests used in data analysis.
  • Understand vectors, matrices, distance measures and similarity measures.
  • Explain gradients, derivatives and gradient descent in model training.

5.3 Probability for Machine Learning

Probability measures how likely an event is to happen. In ML, probability is used when models make predictions under uncertainty, such as a student having an 85% chance of passing.

P(A) = Favourable Outcomes / Total Outcomes
ConceptMeaningExample
Sample SpaceAll possible outcomes.Head, Tail
EventA specific outcome.Getting Head
Independent EventsOne event does not affect another.Two coin tosses
Dependent EventsOne event affects another.Drawing cards without replacement
Mutually Exclusive EventsEvents cannot happen together.Pass and Fail

Probability Rules

RuleFormulaUse
ComplementP(not A)=1-P(A)Event does not happen
AdditionP(A or B)=P(A)+P(B)-P(A and B)Either event happens
MultiplicationP(A and B)=P(A)×P(B)Both independent events happen

5.4 Joint, Conditional and Marginal Probability

TypeMeaningExample
Joint ProbabilityTwo events happen together.High attendance and pass
Conditional ProbabilityProbability of A given B happened.Pass given high attendance
Marginal ProbabilityProbability of one event regardless of others.Overall pass probability
P(A | B) = P(A ∩ B) / P(B)

If 40 students have high attendance and 35 pass, then P(Pass | High Attendance) = 35/40 = 0.875 or 87.5%.

5.5 Bayes' Theorem

Bayes' theorem updates probability when new evidence is available. It is used in spam filtering, medical diagnosis, risk prediction and Naive Bayes classifiers.

P(A | B) = [P(B | A) × P(A)] / P(B)
TermMeaning
P(A)Prior probability
P(B|A)Likelihood
P(B)Total probability of evidence
P(A|B)Updated probability

5.6 Probability Distributions

A probability distribution describes how values are spread across possible outcomes. ML uses distributions to understand data behavior and uncertainty.

DistributionUseExample
NormalBell-shaped continuous dataExam scores
BinomialSuccess/failure outcomesPass or fail
PoissonCounts of eventsWebsite visits per minute
t-DistributionSmall sample inferenceSmall class performance

Sampling distributions show how a statistic such as a sample mean varies across repeated samples.

5.7 Statistics for Machine Learning

Statistics helps summarize, analyze and interpret data. It is essential for understanding data patterns, evaluating models and making data-driven decisions.

Descriptive Statistics

MeasureMeaningML Use
MeanAverage valueAverage marks or sales
MedianMiddle valueUseful with outliers
ModeMost frequent valueMost common category
VarianceAverage squared spreadMeasures variability
Standard DeviationTypical distance from meanShows spread
Variance = Σ(x - mean)² / n
Standard Deviation = √Variance
import numpy as np
marks = [60, 70, 80, 90, 100]
print("Mean:", np.mean(marks))
print("Median:", np.median(marks))
print("Variance:", np.var(marks))
print("Standard Deviation:", np.std(marks))

5.8 Inferential Statistics

ConceptMeaningExample
Confidence IntervalRange likely to contain true population valueAverage score between 70 and 78
Central Limit TheoremSample means tend toward normal distributionRepeated class samples
P-ValueProbability of result under null hypothesisSignificance testing
Hypothesis TestingTests assumptions using evidenceNew teaching method improves results?
Simple Meaning: Inferential statistics helps decide whether a pattern is meaningful or random chance.

5.9 Correlation, Covariance, Skewness and Kurtosis

ConceptMeaningML Use
CovarianceWhether two variables move togetherAttendance and marks
CorrelationStandardized relationship from -1 to +1Feature relationship analysis
SkewnessAsymmetry of distributionDetects imbalance
KurtosisTail heavinessDetects extreme behavior
Correlation near +1 = strong positive relationship
Correlation near -1 = strong negative relationship
Correlation near 0 = weak or no linear relationship

5.10 Hypothesis Testing and Parametric Tests

Hypothesis testing checks whether a claim about data is supported by evidence.

TestPurposeExample
Z-TestCompare sample mean with population mean when sample is largeLarge exam dataset
T-TestCompare means when sample is smallCompare two small classes
F-TestCompare variancesVariation between groups
Chi-Square TestRelationship between categorical variablesCourse type and pass/fail
If p-value < 0.05, result is commonly considered statistically significant.

5.11 Bias-Variance Tradeoff, MLE and MSE

ConceptMeaningML Importance
BiasError from overly simple assumptionsHigh bias can cause underfitting
VarianceError from sensitivity to training dataHigh variance can cause overfitting
Bias-Variance TradeoffBalancing simplicity and flexibilityImproves generalization
Maximum Likelihood EstimationFinds parameters that make observed data most likelyUsed in statistical models
Mean Squared ErrorAverage squared prediction errorRegression model error
MSE = (1/n) Σ(yᵢ - ŷᵢ)²

5.12 Linear Algebra for Machine Learning

Linear algebra represents and manipulates data using vectors and matrices. Many ML algorithms depend on vector and matrix operations.

ConceptMeaningML Use
VectorA list of numbersOne data record
MatrixA rectangular table of numbersDataset with rows and columns
Dot ProductCombines two vectors into a numberNeural networks and similarity
Eigenvalues / EigenvectorsImportant directions in transformationsPCA and dimensionality reduction
SVDMatrix factorizationCompression and recommendation systems
import numpy as np
student_vector = np.array([85, 76, 4])
dataset_matrix = np.array([[85, 76, 4],[70, 88, 3],[90, 92, 5]])
print(student_vector)
print(dataset_matrix)

5.13 Distance and Similarity Measures

MeasureMeaningUse
Euclidean DistanceStraight-line distanceKNN, clustering
Manhattan DistanceSum of absolute differencesGrid-like distance
Cosine SimilarityAngle similarityText and recommendations
Jaccard SimilarityIntersection divided by unionSet and text similarity
OrthogonalityVectors at right anglesIndependent directions
ProjectionMapping one vector onto anotherDimensionality reduction
Euclidean Distance = √Σ(xᵢ - yᵢ)²
Manhattan Distance = Σ|xᵢ - yᵢ|
Cosine Similarity = (A · B) / (||A|| ||B||)

5.14 Calculus for Machine Learning

Calculus is used to optimize Machine Learning models. During training, models adjust parameters to reduce error. Derivatives and gradients show direction and rate of change.

ConceptMeaningML Use
DerivativeRate of change of a functionHow loss changes
GradientVector of partial derivativesDirection to update parameters
Higher-Order DerivativesDerivatives of derivativesCurvature analysis
Multivariable CalculusCalculus with many inputsModels with many features
Chain RuleDerivative of composite functionsBackpropagation
Jacobian MatrixMatrix of first-order partial derivativesVector-valued functions
Hessian MatrixMatrix of second-order partial derivativesAdvanced optimization

5.15 Gradient Descent and Stochastic Gradient Descent

Gradient descent minimizes a loss function by updating parameters step by step in the direction that reduces error.

New Parameter = Old Parameter - Learning Rate × Gradient
MethodMeaningUse
Gradient DescentUses the full datasetStable but may be slow
Stochastic Gradient DescentUses one sample at a timeFaster but noisier
Mini-Batch Gradient DescentUses small batchesCommon in deep learning
Simple Meaning: Gradient descent is like walking downhill step by step until reaching the lowest error point.

5.16 Practical Python Examples

Mean Squared Error Example

actual = [80, 70, 90]
predicted = [78, 74, 88]
errors = []
for i in range(len(actual)):
    error = (actual[i] - predicted[i]) ** 2
    errors.append(error)
mse = sum(errors) / len(errors)
print("MSE:", mse)
Output:
MSE: 8.0

Euclidean Distance Example

import math
student_a = [85, 76]
student_b = [90, 80]
distance = math.sqrt((student_a[0]-student_b[0])**2 + (student_a[1]-student_b[1])**2)
print("Distance:", distance)

Correlation Example

import numpy as np
attendance = [60, 70, 80, 90, 95]
marks = [55, 65, 75, 88, 92]
correlation = np.corrcoef(attendance, marks)
print(correlation)

5.17 Hands-On Practice Activities

Activity 1: Descriptive Statistics

Create a Python list of 10 marks and calculate mean, median, variance and standard deviation.

Activity 2: Probability

Calculate the probability of passing if 35 out of 50 students passed.

Activity 3: Distance Measure

Calculate Euclidean distance between two student records using attendance and marks.

Activity 4: MSE

Calculate Mean Squared Error between actual and predicted marks.

Mini Project: ML Math Report

Create a short analysis report using student marks to calculate descriptive statistics, correlation, MSE and distance between two students.

5.18 Interactive Final Assessment Quiz

Each correct answer gives +1 mark.
Each wrong answer gives -0.5 mark.

Instructions: Select the correct answer for each question and click Submit Assessment.

1. Probability helps measure uncertainty in data.

2. Which measure represents the average value?

3. Standard deviation measures data spread.

4. Bayes' theorem updates probability using evidence.

5. Which test is commonly used for small sample means?

6. Vectors and matrices are part of linear algebra.

7. Euclidean distance measures straight-line distance.

8. MSE measures prediction error in regression.

9. Gradient descent is used to minimize loss functions.

10. The chain rule is important in neural network backpropagation.

Your Score: 0

5.19 Chapter Summary

In this chapter, learners studied probability, statistics, linear algebra and calculus for Machine Learning. These topics support uncertainty modelling, data interpretation, hypothesis testing, distance measurement, similarity comparison, error calculation and model optimization.

Remember: Machine Learning is not only coding. Strong mathematical understanding helps learners interpret models, improve accuracy and make better data-driven decisions.