Chapter 4 - Python Libraries for Data Science & ML

4.1 Chapter Overview

Data Science and Machine Learning require tools that make data processing, analysis, visualization and model building easier. Python is powerful because it has a rich ecosystem of libraries designed specifically for these tasks.

In this chapter, learners will explore five important Python libraries: NumPy for numerical computing, Pandas for data analysis, Matplotlib for basic visualization, Seaborn for statistical visualization, and Scikit-learn for Machine Learning.

Learning Outcome: By the end of this chapter, learners should be able to perform basic data analysis and visualization using Python libraries and understand how these libraries support Machine Learning workflows.

4.2 Learning Objectives

Understand the role of Python libraries in Data Science and Machine Learning.
Use NumPy arrays for numerical calculations.
Use Pandas DataFrames for tabular data analysis.
Create basic charts using Matplotlib.
Create statistical visualizations using Seaborn.
Understand basic Scikit-learn ML workflows.
Perform simple data analysis and visualization using structured steps.

4.3 Python Data Science Library Ecosystem

Library	Main Purpose	Common Use
NumPy	Numerical computing	Arrays, mathematical operations, matrix calculations
Pandas	Data analysis	Tables, CSV files, cleaning, filtering, grouping
Matplotlib	Basic visualization	Line charts, bar charts, scatter plots
Seaborn	Statistical visualization	Correlation heatmaps, distribution plots, category plots
Scikit-learn	Machine Learning	Training models, splitting data, evaluation metrics

1NumPy
Numbers

2Pandas
Tables

3Matplotlib
Basic Charts

4Seaborn
Statistical Visuals

5Scikit-learn
ML Models

4.4 NumPy for Numerical Computing

NumPy stands for Numerical Python. It is used for fast mathematical operations and array processing. In Machine Learning, data is often converted into numerical arrays before model training.

Why NumPy is Important

Handles large numeric datasets efficiently.
Supports mathematical operations on arrays.
Provides the foundation for many ML libraries.
Useful for matrix and vector calculations.

Creating a NumPy Array

import numpy as np

marks = np.array([80, 75, 90, 60])
print(marks)

Output:
[80 75 90 60]

Basic NumPy Calculations

import numpy as np

marks = np.array([80, 75, 90, 60])

print("Mean:", np.mean(marks))
print("Maximum:", np.max(marks))
print("Minimum:", np.min(marks))
print("Standard Deviation:", np.std(marks))

ML Connection: Many ML algorithms internally use arrays, vectors and matrices. NumPy helps learners understand how numerical data is represented.

4.5 Pandas for Data Analysis

Pandas is one of the most important libraries for Data Science. It works with tabular data similar to Excel sheets, but with programming power.

Data Structure	Explanation	Example
Series	One-dimensional data	One column of marks
DataFrame	Two-dimensional table	Student records table

Create a DataFrame

import pandas as pd

data = {
    "Name": ["Amin", "Mei Ling", "Ravi"],
    "Attendance": [85, 70, 90],
    "Marks": [76, 88, 92]
}

df = pd.DataFrame(data)
print(df)

Output:
Name Attendance Marks
0 Amin 85 76
1 Mei Ling 70 88
2 Ravi 90 92

Basic Data Analysis

print(df.head())
print(df.info())
print(df.describe())

Filtering Data

high_marks = df[df["Marks"] >= 80]
print(high_marks)

4.6 Data Cleaning with Pandas

Pandas provides easy commands to clean data. Common operations include removing duplicates, filling missing values, renaming columns and cleaning text.

Handling Missing Values

import pandas as pd

data = {
    "Name": ["Amin", "Mei Ling", "Ravi"],
    "Attendance": [85, None, 90],
    "Marks": [76, 88, 92]
}

df = pd.DataFrame(data)
df["Attendance"] = df["Attendance"].fillna(df["Attendance"].mean())
print(df)

Cleaning Text

df["Name"] = df["Name"].str.strip().str.title()

Data Analysis Skill: Before visualization or ML training, data should be cleaned and checked for missing values, incorrect formats and duplicates.

4.7 Matplotlib for Data Visualization

Matplotlib is used to create charts and graphs. Visualization helps learners understand data patterns, trends and comparisons.

Identifies trends and patterns.
Makes data easier to explain.
Detects outliers and unusual values.
Supports better decisions.

Bar Chart Example

import matplotlib.pyplot as plt

students = ["Amin", "Mei Ling", "Ravi"]
marks = [76, 88, 92]

plt.bar(students, marks)
plt.title("Student Marks")
plt.xlabel("Students")
plt.ylabel("Marks")
plt.show()

Line Chart Example

weeks = [1, 2, 3, 4]
scores = [60, 68, 75, 82]

plt.plot(weeks, scores, marker="o")
plt.title("Weekly Learning Progress")
plt.xlabel("Week")
plt.ylabel("Score")
plt.show()

4.8 Seaborn for Statistical Visualization

Seaborn is built on top of Matplotlib and provides attractive statistical visualizations. It is commonly used to explore relationships between variables.

Chart	Purpose
histplot	Shows distribution of values
boxplot	Shows spread and outliers
scatterplot	Shows relationship between two variables
heatmap	Shows correlation between numeric columns

Scatter Plot Example

import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

data = {
    "Attendance": [85, 70, 90, 60, 95],
    "Marks": [76, 88, 92, 55, 96]
}

df = pd.DataFrame(data)
sns.scatterplot(data=df, x="Attendance", y="Marks")
plt.title("Attendance vs Marks")
plt.show()

Correlation Heatmap

sns.heatmap(df.corr(), annot=True)
plt.title("Correlation Heatmap")
plt.show()

ML Connection: Correlation helps identify which features may be related to the target variable.

4.9 Scikit-learn for Machine Learning

Scikit-learn provides tools for data splitting, model training, prediction and evaluation.

Tool	Purpose
train_test_split	Splits data into training and testing sets
LinearRegression	Builds regression models
DecisionTreeClassifier	Builds classification models
accuracy_score	Measures classification accuracy
mean_squared_error	Measures regression error

Train-Test Split Example

from sklearn.model_selection import train_test_split

X = [[85], [70], [90], [60], [95]]
y = [1, 1, 1, 0, 1]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print("Training Data:", X_train)
print("Testing Data:", X_test)

Decision Tree Example

from sklearn.tree import DecisionTreeClassifier

X = [[85, 76], [70, 88], [90, 92], [60, 55], [95, 96]]
y = [1, 1, 1, 0, 1]

model = DecisionTreeClassifier()
model.fit(X, y)

prediction = model.predict([[80, 75]])
print("Prediction:", prediction)

Possible Output:
Prediction: [1]

Meaning: The model predicts Pass.

4.10 Complete Data Analysis and Visualization Workflow

The following workflow shows how the libraries work together in a Data Science and ML project.

1Pandas
Load Data

2Pandas
Clean Data

3NumPy
Calculate

4Matplotlib
Visualize

5Seaborn
Explore

6Scikit-learn
Model

Integrated Example

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

data = {
    "Name": ["Amin", "Mei Ling", "Ravi", "Siti", "John"],
    "Attendance": [85, 70, 90, 60, 95],
    "Marks": [76, 88, 92, 55, 96]
}

df = pd.DataFrame(data)
print(df.describe())

average_marks = np.mean(df["Marks"])
print("Average Marks:", average_marks)

plt.bar(df["Name"], df["Marks"])
plt.title("Student Marks")
plt.show()

sns.scatterplot(data=df, x="Attendance", y="Marks")
plt.title("Attendance vs Marks")
plt.show()

4.11 Choosing the Right Library

Task	Recommended Library	Reason
Calculate mean or standard deviation	NumPy	Fast numerical operations
Read CSV and clean table data	Pandas	Excellent DataFrame support
Create a simple bar chart	Matplotlib	Flexible basic visualization
Create statistical plots	Seaborn	Attractive and analysis-friendly charts
Train a Machine Learning model	Scikit-learn	Ready-made ML algorithms and evaluation tools

4.12 Common Beginner Mistakes

Mistake	Problem	Correction
Skipping data inspection	Errors and missing values may go unnoticed.	Use df.head(), df.info() and df.describe().
Visualizing dirty data	Charts may be misleading.	Clean data before visualization.
Using the wrong chart type	Data story becomes unclear.	Choose chart based on analysis purpose.
Training model before preprocessing	Model performance may be poor.	Clean, encode and scale data first.
Not splitting train and test data	Cannot measure model performance fairly.	Use train_test_split.

4.13 Hands-On Practice Activities

Activity 1: NumPy Statistics

Create a NumPy array of student marks and calculate mean, maximum, minimum and standard deviation.

Activity 2: Pandas DataFrame

Create a DataFrame with student name, attendance and marks. Display the first records and summary statistics.

Activity 3: Matplotlib Chart

Create a bar chart showing student marks.

Activity 4: Seaborn Scatter Plot

Create a scatter plot showing the relationship between attendance and marks.

Mini Project: Student Performance Analysis

Create a small dataset of attendance and marks. Analyze it with Pandas, visualize it with Matplotlib or Seaborn, and split it using Scikit-learn.

4.14 Interactive Final Assessment Quiz

Each correct answer gives +1 mark.
Each wrong answer gives -0.5 mark.

Instructions: Select the correct answer for each question and click Submit Assessment.

1. Which library is mainly used for numerical computing?

NumPy Photoshop Word Paint

2. Which library is commonly used for DataFrames?

Pandas Matplotlib Seaborn HTML

3. Which library is used to create basic charts in Python?

Matplotlib Notepad Excel only CSS

4. Seaborn is useful for statistical visualization.

True False

5. Scikit-learn is used for Machine Learning workflows.

True False

6. Which function is commonly used to split data into training and testing sets?

train_test_split print_split chart_split html_split

7. Pandas can read and analyze tabular data.

True False

8. Matplotlib and Seaborn can support data visualization.

True False

9. Data should be cleaned before visualization and model training.

True False

10. Which library is commonly used to train ML models?

Scikit-learn Paint Calculator Browser history

Your Score: 0

4.15 Chapter Summary

In this chapter, learners studied key Python libraries for Data Science and Machine Learning. They learned how NumPy supports numerical computing, Pandas supports data analysis, Matplotlib and Seaborn support visualization, and Scikit-learn supports Machine Learning workflows.

Remember: A successful Data Science and ML workflow combines data cleaning, analysis, visualization and model preparation. These libraries work together to help learners perform data analysis and visualization effectively.