Chapter 4: Python Libraries for Data Science & ML

Use NumPy, Pandas, Matplotlib, Seaborn and Scikit-learn to perform data analysis, visualization and beginner Machine Learning workflows.

NumPyPandasMatplotlibSeabornScikit-learn
NumPy
Arrays
Pandas
DataFrames
Charts
Visualization
ML
Models

4.1 Chapter Overview

Data Science and Machine Learning require tools that make data processing, analysis, visualization and model building easier. Python is powerful because it has a rich ecosystem of libraries designed specifically for these tasks.

In this chapter, learners will explore five important Python libraries: NumPy for numerical computing, Pandas for data analysis, Matplotlib for basic visualization, Seaborn for statistical visualization, and Scikit-learn for Machine Learning.

Learning Outcome: By the end of this chapter, learners should be able to perform basic data analysis and visualization using Python libraries and understand how these libraries support Machine Learning workflows.

4.2 Learning Objectives

  • Understand the role of Python libraries in Data Science and Machine Learning.
  • Use NumPy arrays for numerical calculations.
  • Use Pandas DataFrames for tabular data analysis.
  • Create basic charts using Matplotlib.
  • Create statistical visualizations using Seaborn.
  • Understand basic Scikit-learn ML workflows.
  • Perform simple data analysis and visualization using structured steps.

4.3 Python Data Science Library Ecosystem

LibraryMain PurposeCommon Use
NumPyNumerical computingArrays, mathematical operations, matrix calculations
PandasData analysisTables, CSV files, cleaning, filtering, grouping
MatplotlibBasic visualizationLine charts, bar charts, scatter plots
SeabornStatistical visualizationCorrelation heatmaps, distribution plots, category plots
Scikit-learnMachine LearningTraining models, splitting data, evaluation metrics
1NumPy
Numbers
2Pandas
Tables
3Matplotlib
Basic Charts
4Seaborn
Statistical Visuals
5Scikit-learn
ML Models

4.4 NumPy for Numerical Computing

NumPy stands for Numerical Python. It is used for fast mathematical operations and array processing. In Machine Learning, data is often converted into numerical arrays before model training.

Why NumPy is Important

  • Handles large numeric datasets efficiently.
  • Supports mathematical operations on arrays.
  • Provides the foundation for many ML libraries.
  • Useful for matrix and vector calculations.

Creating a NumPy Array

import numpy as np

marks = np.array([80, 75, 90, 60])
print(marks)
Output:
[80 75 90 60]

Basic NumPy Calculations

import numpy as np

marks = np.array([80, 75, 90, 60])

print("Mean:", np.mean(marks))
print("Maximum:", np.max(marks))
print("Minimum:", np.min(marks))
print("Standard Deviation:", np.std(marks))
ML Connection: Many ML algorithms internally use arrays, vectors and matrices. NumPy helps learners understand how numerical data is represented.

4.5 Pandas for Data Analysis

Pandas is one of the most important libraries for Data Science. It works with tabular data similar to Excel sheets, but with programming power.

Data StructureExplanationExample
SeriesOne-dimensional dataOne column of marks
DataFrameTwo-dimensional tableStudent records table

Create a DataFrame

import pandas as pd

data = {
    "Name": ["Amin", "Mei Ling", "Ravi"],
    "Attendance": [85, 70, 90],
    "Marks": [76, 88, 92]
}

df = pd.DataFrame(data)
print(df)
Output:
Name Attendance Marks
0 Amin 85 76
1 Mei Ling 70 88
2 Ravi 90 92

Basic Data Analysis

print(df.head())
print(df.info())
print(df.describe())

Filtering Data

high_marks = df[df["Marks"] >= 80]
print(high_marks)

4.6 Data Cleaning with Pandas

Pandas provides easy commands to clean data. Common operations include removing duplicates, filling missing values, renaming columns and cleaning text.

Handling Missing Values

import pandas as pd

data = {
    "Name": ["Amin", "Mei Ling", "Ravi"],
    "Attendance": [85, None, 90],
    "Marks": [76, 88, 92]
}

df = pd.DataFrame(data)
df["Attendance"] = df["Attendance"].fillna(df["Attendance"].mean())
print(df)

Cleaning Text

df["Name"] = df["Name"].str.strip().str.title()
Data Analysis Skill: Before visualization or ML training, data should be cleaned and checked for missing values, incorrect formats and duplicates.

4.7 Matplotlib for Data Visualization

Matplotlib is used to create charts and graphs. Visualization helps learners understand data patterns, trends and comparisons.

  • Identifies trends and patterns.
  • Makes data easier to explain.
  • Detects outliers and unusual values.
  • Supports better decisions.

Bar Chart Example

import matplotlib.pyplot as plt

students = ["Amin", "Mei Ling", "Ravi"]
marks = [76, 88, 92]

plt.bar(students, marks)
plt.title("Student Marks")
plt.xlabel("Students")
plt.ylabel("Marks")
plt.show()

Line Chart Example

weeks = [1, 2, 3, 4]
scores = [60, 68, 75, 82]

plt.plot(weeks, scores, marker="o")
plt.title("Weekly Learning Progress")
plt.xlabel("Week")
plt.ylabel("Score")
plt.show()

4.8 Seaborn for Statistical Visualization

Seaborn is built on top of Matplotlib and provides attractive statistical visualizations. It is commonly used to explore relationships between variables.

ChartPurpose
histplotShows distribution of values
boxplotShows spread and outliers
scatterplotShows relationship between two variables
heatmapShows correlation between numeric columns

Scatter Plot Example

import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

data = {
    "Attendance": [85, 70, 90, 60, 95],
    "Marks": [76, 88, 92, 55, 96]
}

df = pd.DataFrame(data)
sns.scatterplot(data=df, x="Attendance", y="Marks")
plt.title("Attendance vs Marks")
plt.show()

Correlation Heatmap

sns.heatmap(df.corr(), annot=True)
plt.title("Correlation Heatmap")
plt.show()
ML Connection: Correlation helps identify which features may be related to the target variable.

4.9 Scikit-learn for Machine Learning

Scikit-learn provides tools for data splitting, model training, prediction and evaluation.

ToolPurpose
train_test_splitSplits data into training and testing sets
LinearRegressionBuilds regression models
DecisionTreeClassifierBuilds classification models
accuracy_scoreMeasures classification accuracy
mean_squared_errorMeasures regression error

Train-Test Split Example

from sklearn.model_selection import train_test_split

X = [[85], [70], [90], [60], [95]]
y = [1, 1, 1, 0, 1]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print("Training Data:", X_train)
print("Testing Data:", X_test)

Decision Tree Example

from sklearn.tree import DecisionTreeClassifier

X = [[85, 76], [70, 88], [90, 92], [60, 55], [95, 96]]
y = [1, 1, 1, 0, 1]

model = DecisionTreeClassifier()
model.fit(X, y)

prediction = model.predict([[80, 75]])
print("Prediction:", prediction)
Possible Output:
Prediction: [1]

Meaning: The model predicts Pass.

4.10 Complete Data Analysis and Visualization Workflow

The following workflow shows how the libraries work together in a Data Science and ML project.

1Pandas
Load Data
2Pandas
Clean Data
3NumPy
Calculate
4Matplotlib
Visualize
5Seaborn
Explore
6Scikit-learn
Model

Integrated Example

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

data = {
    "Name": ["Amin", "Mei Ling", "Ravi", "Siti", "John"],
    "Attendance": [85, 70, 90, 60, 95],
    "Marks": [76, 88, 92, 55, 96]
}

df = pd.DataFrame(data)
print(df.describe())

average_marks = np.mean(df["Marks"])
print("Average Marks:", average_marks)

plt.bar(df["Name"], df["Marks"])
plt.title("Student Marks")
plt.show()

sns.scatterplot(data=df, x="Attendance", y="Marks")
plt.title("Attendance vs Marks")
plt.show()

4.11 Choosing the Right Library

TaskRecommended LibraryReason
Calculate mean or standard deviationNumPyFast numerical operations
Read CSV and clean table dataPandasExcellent DataFrame support
Create a simple bar chartMatplotlibFlexible basic visualization
Create statistical plotsSeabornAttractive and analysis-friendly charts
Train a Machine Learning modelScikit-learnReady-made ML algorithms and evaluation tools

4.12 Common Beginner Mistakes

MistakeProblemCorrection
Skipping data inspectionErrors and missing values may go unnoticed.Use df.head(), df.info() and df.describe().
Visualizing dirty dataCharts may be misleading.Clean data before visualization.
Using the wrong chart typeData story becomes unclear.Choose chart based on analysis purpose.
Training model before preprocessingModel performance may be poor.Clean, encode and scale data first.
Not splitting train and test dataCannot measure model performance fairly.Use train_test_split.

4.13 Hands-On Practice Activities

Activity 1: NumPy Statistics

Create a NumPy array of student marks and calculate mean, maximum, minimum and standard deviation.

Activity 2: Pandas DataFrame

Create a DataFrame with student name, attendance and marks. Display the first records and summary statistics.

Activity 3: Matplotlib Chart

Create a bar chart showing student marks.

Activity 4: Seaborn Scatter Plot

Create a scatter plot showing the relationship between attendance and marks.

Mini Project: Student Performance Analysis

Create a small dataset of attendance and marks. Analyze it with Pandas, visualize it with Matplotlib or Seaborn, and split it using Scikit-learn.

4.14 Interactive Final Assessment Quiz

Each correct answer gives +1 mark.
Each wrong answer gives -0.5 mark.

Instructions: Select the correct answer for each question and click Submit Assessment.

1. Which library is mainly used for numerical computing?

2. Which library is commonly used for DataFrames?

3. Which library is used to create basic charts in Python?

4. Seaborn is useful for statistical visualization.

5. Scikit-learn is used for Machine Learning workflows.

6. Which function is commonly used to split data into training and testing sets?

7. Pandas can read and analyze tabular data.

8. Matplotlib and Seaborn can support data visualization.

9. Data should be cleaned before visualization and model training.

10. Which library is commonly used to train ML models?

Your Score: 0

4.15 Chapter Summary

In this chapter, learners studied key Python libraries for Data Science and Machine Learning. They learned how NumPy supports numerical computing, Pandas supports data analysis, Matplotlib and Seaborn support visualization, and Scikit-learn supports Machine Learning workflows.

Remember: A successful Data Science and ML workflow combines data cleaning, analysis, visualization and model preparation. These libraries work together to help learners perform data analysis and visualization effectively.