Chapter 4: Python Libraries for Data Science & ML
Use NumPy, Pandas, Matplotlib, Seaborn and Scikit-learn to perform data analysis, visualization and beginner Machine Learning workflows.
Arrays
DataFrames
Visualization
Models
4.1 Chapter Overview
Data Science and Machine Learning require tools that make data processing, analysis, visualization and model building easier. Python is powerful because it has a rich ecosystem of libraries designed specifically for these tasks.
In this chapter, learners will explore five important Python libraries: NumPy for numerical computing, Pandas for data analysis, Matplotlib for basic visualization, Seaborn for statistical visualization, and Scikit-learn for Machine Learning.
4.2 Learning Objectives
- Understand the role of Python libraries in Data Science and Machine Learning.
- Use NumPy arrays for numerical calculations.
- Use Pandas DataFrames for tabular data analysis.
- Create basic charts using Matplotlib.
- Create statistical visualizations using Seaborn.
- Understand basic Scikit-learn ML workflows.
- Perform simple data analysis and visualization using structured steps.
4.3 Python Data Science Library Ecosystem
| Library | Main Purpose | Common Use |
|---|---|---|
| NumPy | Numerical computing | Arrays, mathematical operations, matrix calculations |
| Pandas | Data analysis | Tables, CSV files, cleaning, filtering, grouping |
| Matplotlib | Basic visualization | Line charts, bar charts, scatter plots |
| Seaborn | Statistical visualization | Correlation heatmaps, distribution plots, category plots |
| Scikit-learn | Machine Learning | Training models, splitting data, evaluation metrics |
Numbers
Tables
Basic Charts
Statistical Visuals
ML Models
4.4 NumPy for Numerical Computing
NumPy stands for Numerical Python. It is used for fast mathematical operations and array processing. In Machine Learning, data is often converted into numerical arrays before model training.
Why NumPy is Important
- Handles large numeric datasets efficiently.
- Supports mathematical operations on arrays.
- Provides the foundation for many ML libraries.
- Useful for matrix and vector calculations.
Creating a NumPy Array
import numpy as np marks = np.array([80, 75, 90, 60]) print(marks)
[80 75 90 60]
Basic NumPy Calculations
import numpy as np
marks = np.array([80, 75, 90, 60])
print("Mean:", np.mean(marks))
print("Maximum:", np.max(marks))
print("Minimum:", np.min(marks))
print("Standard Deviation:", np.std(marks))4.5 Pandas for Data Analysis
Pandas is one of the most important libraries for Data Science. It works with tabular data similar to Excel sheets, but with programming power.
| Data Structure | Explanation | Example |
|---|---|---|
| Series | One-dimensional data | One column of marks |
| DataFrame | Two-dimensional table | Student records table |
Create a DataFrame
import pandas as pd
data = {
"Name": ["Amin", "Mei Ling", "Ravi"],
"Attendance": [85, 70, 90],
"Marks": [76, 88, 92]
}
df = pd.DataFrame(data)
print(df)Name Attendance Marks
0 Amin 85 76
1 Mei Ling 70 88
2 Ravi 90 92
Basic Data Analysis
print(df.head()) print(df.info()) print(df.describe())
Filtering Data
high_marks = df[df["Marks"] >= 80] print(high_marks)
4.6 Data Cleaning with Pandas
Pandas provides easy commands to clean data. Common operations include removing duplicates, filling missing values, renaming columns and cleaning text.
Handling Missing Values
import pandas as pd
data = {
"Name": ["Amin", "Mei Ling", "Ravi"],
"Attendance": [85, None, 90],
"Marks": [76, 88, 92]
}
df = pd.DataFrame(data)
df["Attendance"] = df["Attendance"].fillna(df["Attendance"].mean())
print(df)Cleaning Text
df["Name"] = df["Name"].str.strip().str.title()
4.7 Matplotlib for Data Visualization
Matplotlib is used to create charts and graphs. Visualization helps learners understand data patterns, trends and comparisons.
- Identifies trends and patterns.
- Makes data easier to explain.
- Detects outliers and unusual values.
- Supports better decisions.
Bar Chart Example
import matplotlib.pyplot as plt
students = ["Amin", "Mei Ling", "Ravi"]
marks = [76, 88, 92]
plt.bar(students, marks)
plt.title("Student Marks")
plt.xlabel("Students")
plt.ylabel("Marks")
plt.show()Line Chart Example
weeks = [1, 2, 3, 4]
scores = [60, 68, 75, 82]
plt.plot(weeks, scores, marker="o")
plt.title("Weekly Learning Progress")
plt.xlabel("Week")
plt.ylabel("Score")
plt.show()4.8 Seaborn for Statistical Visualization
Seaborn is built on top of Matplotlib and provides attractive statistical visualizations. It is commonly used to explore relationships between variables.
| Chart | Purpose |
|---|---|
| histplot | Shows distribution of values |
| boxplot | Shows spread and outliers |
| scatterplot | Shows relationship between two variables |
| heatmap | Shows correlation between numeric columns |
Scatter Plot Example
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
data = {
"Attendance": [85, 70, 90, 60, 95],
"Marks": [76, 88, 92, 55, 96]
}
df = pd.DataFrame(data)
sns.scatterplot(data=df, x="Attendance", y="Marks")
plt.title("Attendance vs Marks")
plt.show()Correlation Heatmap
sns.heatmap(df.corr(), annot=True)
plt.title("Correlation Heatmap")
plt.show()4.9 Scikit-learn for Machine Learning
Scikit-learn provides tools for data splitting, model training, prediction and evaluation.
| Tool | Purpose |
|---|---|
| train_test_split | Splits data into training and testing sets |
| LinearRegression | Builds regression models |
| DecisionTreeClassifier | Builds classification models |
| accuracy_score | Measures classification accuracy |
| mean_squared_error | Measures regression error |
Train-Test Split Example
from sklearn.model_selection import train_test_split
X = [[85], [70], [90], [60], [95]]
y = [1, 1, 1, 0, 1]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
print("Training Data:", X_train)
print("Testing Data:", X_test)Decision Tree Example
from sklearn.tree import DecisionTreeClassifier
X = [[85, 76], [70, 88], [90, 92], [60, 55], [95, 96]]
y = [1, 1, 1, 0, 1]
model = DecisionTreeClassifier()
model.fit(X, y)
prediction = model.predict([[80, 75]])
print("Prediction:", prediction)Prediction: [1]
Meaning: The model predicts Pass.
4.10 Complete Data Analysis and Visualization Workflow
The following workflow shows how the libraries work together in a Data Science and ML project.
Load Data
Clean Data
Calculate
Visualize
Explore
Model
Integrated Example
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
data = {
"Name": ["Amin", "Mei Ling", "Ravi", "Siti", "John"],
"Attendance": [85, 70, 90, 60, 95],
"Marks": [76, 88, 92, 55, 96]
}
df = pd.DataFrame(data)
print(df.describe())
average_marks = np.mean(df["Marks"])
print("Average Marks:", average_marks)
plt.bar(df["Name"], df["Marks"])
plt.title("Student Marks")
plt.show()
sns.scatterplot(data=df, x="Attendance", y="Marks")
plt.title("Attendance vs Marks")
plt.show()4.11 Choosing the Right Library
| Task | Recommended Library | Reason |
|---|---|---|
| Calculate mean or standard deviation | NumPy | Fast numerical operations |
| Read CSV and clean table data | Pandas | Excellent DataFrame support |
| Create a simple bar chart | Matplotlib | Flexible basic visualization |
| Create statistical plots | Seaborn | Attractive and analysis-friendly charts |
| Train a Machine Learning model | Scikit-learn | Ready-made ML algorithms and evaluation tools |
4.12 Common Beginner Mistakes
| Mistake | Problem | Correction |
|---|---|---|
| Skipping data inspection | Errors and missing values may go unnoticed. | Use df.head(), df.info() and df.describe(). |
| Visualizing dirty data | Charts may be misleading. | Clean data before visualization. |
| Using the wrong chart type | Data story becomes unclear. | Choose chart based on analysis purpose. |
| Training model before preprocessing | Model performance may be poor. | Clean, encode and scale data first. |
| Not splitting train and test data | Cannot measure model performance fairly. | Use train_test_split. |
4.13 Hands-On Practice Activities
Activity 1: NumPy Statistics
Create a NumPy array of student marks and calculate mean, maximum, minimum and standard deviation.
Activity 2: Pandas DataFrame
Create a DataFrame with student name, attendance and marks. Display the first records and summary statistics.
Activity 3: Matplotlib Chart
Create a bar chart showing student marks.
Activity 4: Seaborn Scatter Plot
Create a scatter plot showing the relationship between attendance and marks.
Mini Project: Student Performance Analysis
Create a small dataset of attendance and marks. Analyze it with Pandas, visualize it with Matplotlib or Seaborn, and split it using Scikit-learn.
4.14 Interactive Final Assessment Quiz
Each correct answer gives +1 mark.
Each wrong answer gives -0.5 mark.
1. Which library is mainly used for numerical computing?
2. Which library is commonly used for DataFrames?
3. Which library is used to create basic charts in Python?
4. Seaborn is useful for statistical visualization.
5. Scikit-learn is used for Machine Learning workflows.
6. Which function is commonly used to split data into training and testing sets?
7. Pandas can read and analyze tabular data.
8. Matplotlib and Seaborn can support data visualization.
9. Data should be cleaned before visualization and model training.
10. Which library is commonly used to train ML models?
Your Score: 0
4.15 Chapter Summary
In this chapter, learners studied key Python libraries for Data Science and Machine Learning. They learned how NumPy supports numerical computing, Pandas supports data analysis, Matplotlib and Seaborn support visualization, and Scikit-learn supports Machine Learning workflows.