Chapter 3 - Data Collection & Preprocessing

3.1 Chapter Overview

Machine Learning models depend heavily on data quality. A model trained with poor, incomplete, inconsistent or biased data will usually produce poor predictions. This is why data collection and preprocessing are among the most important stages in any AI or ML project.

Data preprocessing means preparing raw data before it is used for model training. This may include removing duplicates, handling missing values, correcting formats, converting text categories into numbers, scaling values, selecting features and splitting data into training and testing sets.

Key Learning Focus:
Good Machine Learning begins with good data. Clean, structured and quality datasets improve model accuracy,
reliability and usefulness.

3.2 Learning Objectives

Understand the importance of data collection in Machine Learning projects.
Identify different sources of data for AI and ML development.
Recognize common data quality problems.
Clean missing, duplicate and inconsistent data.
Transform raw data into structured datasets.
Encode categorical data into machine-readable format.
Scale numeric features for better model performance.
Prepare structured and quality datasets for Machine Learning workflows.

Learning Outcome

By the end of this chapter, learners will be able to develop structured and quality datasets for Machine Learning by applying data collection, cleaning, transformation and preprocessing techniques.

3.3 What is Data Collection?

Data collection is the process of gathering information that will be used for analysis, prediction or model training. In Machine Learning, data is the foundation because the model learns patterns from examples.

Common Data Sources

Data Source	Example	Use in ML
CSV / Excel Files	Student marks, sales records, attendance sheets	Common source for beginner ML projects
Databases	Customer records, inventory systems	Used in business and enterprise applications
APIs	Weather, finance, social media data	Used for live or updated data
Sensors / IoT	Temperature, machine vibration, traffic data	Used in smart manufacturing and automation
Web Forms	Survey responses, registration forms	Used for collecting user information

Example:
For a student performance prediction model, useful data may include attendance percentage,
assignment marks, quiz scores, study hours and final exam results.

3.4 Types of Data in ML Datasets

Before preprocessing data, it is important to understand what type of data is being collected. Different data types require different cleaning and transformation methods.

Data Type	Description	Example
Numeric Data	Numbers used for calculation.	Age, marks, price, attendance percentage
Categorical Data	Data grouped into categories.	Gender, course type, city, result
Text Data	Free-form written content.	Feedback, comments, review text
Date / Time Data	Time-based information.	Registration date, login time
Boolean Data	True or False values.	Payment completed, certificate issued

3.5 What is Data Quality?

Data quality means how suitable, accurate and reliable the data is for analysis and model training. Poor data quality can cause incorrect predictions and misleading conclusions.

Important Data Quality Dimensions

Accuracy

Data should represent the correct value. Example: marks should be entered correctly.

Completeness

Important values should not be missing. Example: attendance should not be blank.

Consistency

Data should follow the same format. Example: course names should be written consistently.

Relevance

Data should be useful for the ML problem. Example: shoe size may not help predict exam results.

Common Data Problems

Problem	Example	Impact
Missing Values	Attendance is blank	Model may not learn correctly
Duplicate Records	Same student appears twice	Results may become biased
Incorrect Format	Marks stored as text	Calculation may fail
Inconsistent Labels	Pass, pass, PASSED	Categories become confusing
Outliers	Marks = 999	Can distort analysis

3.6 Data Cleaning

Data cleaning is the process of fixing problems in raw data. It may include correcting spelling, removing duplicates, handling missing values, converting data types and standardizing formats.

Example Raw Student Data

Name	Attendance	Marks	Result
amin	85	76	pass
Mei Ling		88	PASS
Ravi	90	999	Pass

This dataset has extra spaces, missing attendance, inconsistent result labels and an unrealistic mark value. Before using it for ML, these issues must be fixed.

Python Example: Cleaning Text Values

name = "  amin  "
result = "pass"

clean_name = name.strip().title()
clean_result = result.strip().title()

print(clean_name)
print(clean_result)

Output:
Amin
Pass

3.7 Handling Missing Values

Missing values occur when data is not available or not entered. Missing values must be handled carefully because Machine Learning models usually cannot train properly with blank values.

Common Ways to Handle Missing Values

Method	Explanation	Example
Remove Records	Delete rows with missing values.	Remove student record with no marks
Fill with Mean	Use average value for missing numeric data.	Fill missing attendance with average attendance
Fill with Mode	Use most common value for missing category data.	Fill missing result with most common result
Use Placeholder	Use a fixed value when data is unknown.	Fill missing city with "Unknown"

Python Example: Fill Missing Attendance

attendance = [85, None, 90, 75]

valid_attendance = [value for value in attendance if value is not None]
average_attendance = sum(valid_attendance) / len(valid_attendance)

cleaned_attendance = []

for value in attendance:
    if value is None:
        cleaned_attendance.append(average_attendance)
    else:
        cleaned_attendance.append(value)

print(cleaned_attendance)

Output:
[85, 83.33333333333333, 90, 75]

Learning Note:
Filling missing values is also called imputation.

3.8 Removing Duplicate Data

Duplicate records can create bias in Machine Learning. If the same record appears many times, the model may give that pattern too much importance.

Python Example: Remove Duplicate Student Names

students = ["Amin", "Mei Ling", "Amin", "Ravi", "Ravi"]

unique_students = list(set(students))

print(unique_students)

Output:
['Amin', 'Mei Ling', 'Ravi']

Note: The order may appear differently because sets are unordered.

3.9 Data Transformation

Data transformation means changing data into a more useful format. Machine Learning models require data to be consistent, structured and mostly numeric.

Examples of Data Transformation

Raw Data	Transformation	Model-Ready Data
" pass "	strip() and title()	"Pass"
"85"	Convert to integer	85
"Male"	Encode category	1
90, 80, 70	Scale values	0.9, 0.8, 0.7

Python Example: Convert Text Number to Integer

marks_text = "85"

marks_number = int(marks_text)

print(marks_number + 10)

Output:
95

3.10 Encoding Categorical Data

Machine Learning models usually work with numbers. Therefore, text categories must often be converted into numeric form. This process is called encoding.

Example: Encoding Result Labels

Category	Encoded Value
Fail	0
Pass	1

result = "Pass"

if result == "Pass":
    encoded_result = 1
else:
    encoded_result = 0

print(encoded_result)

Output:
1

Example: Encoding Course Categories

course = "AI"

course_mapping = {
    "Python": 0,
    "AI": 1,
    "Data Science": 2
}

encoded_course = course_mapping[course]

print(encoded_course)

Output:
1

3.11 Feature Scaling

Feature scaling means converting numeric values into a similar range. This helps some Machine Learning algorithms perform better because large values do not dominate smaller values.

Example

If one feature is marks from 0 to 100 and another feature is income from 0 to 100000, the income value may dominate the model unless scaling is applied.

Simple Scaling Formula

scaled_value = value / maximum_value

Python Example: Scale Marks

marks = [50, 75, 100]

scaled_marks = []

for mark in marks:
    scaled_marks.append(mark / 100)

print(scaled_marks)

Output:
[0.5, 0.75, 1.0]

3.12 Feature Selection

Feature selection means choosing the most useful input variables for a Machine Learning model. Not every column in a dataset is useful for prediction.

Example: Student Pass Prediction

Feature	Useful?	Reason
Attendance	Yes	Strongly related to performance
Assignment Marks	Yes	Shows learning progress
Study Hours	Yes	May affect exam performance
Favourite Colour	No	Not relevant to academic result

Key Idea:
Good features help the model learn meaningful patterns. Irrelevant features may reduce model quality.

3.13 Train-Test Split

In Machine Learning, data is usually divided into training data and testing data. Training data is used to teach the model, while testing data is used to check how well the model performs.

1Full Dataset

2Training Data

3Testing Data

4Evaluate Model

Common Split

Data Portion	Purpose
80% Training Data	Used to train the model
20% Testing Data	Used to evaluate model performance

Simple Python Example

data = ["Record1", "Record2", "Record3", "Record4", "Record5"]

training_data = data[:4]
testing_data = data[4:]

print("Training:", training_data)
print("Testing:", testing_data)

Output:
Training: ['Record1', 'Record2', 'Record3', 'Record4']
Testing: ['Record5']

3.14 Practical Example: Complete Preprocessing Workflow

The following example demonstrates a beginner-friendly preprocessing workflow using plain Python. It cleans names, handles missing attendance, validates marks, encodes results and prepares structured records.

students = [
    {"name": " amin ", "attendance": 85, "marks": 76, "result": "pass"},
    {"name": "mei ling", "attendance": None, "marks": 88, "result": "PASS"},
    {"name": "ravi", "attendance": 90, "marks": 999, "result": "Pass"}
]

valid_attendance = [
    student["attendance"] for student in students
    if student["attendance"] is not None
]

average_attendance = sum(valid_attendance) / len(valid_attendance)

cleaned_students = []

for student in students:
    name = student["name"].strip().title()

    if student["attendance"] is None:
        attendance = average_attendance
    else:
        attendance = student["attendance"]

    if student["marks"] > 100:
        marks = 100
    else:
        marks = student["marks"]

    result = student["result"].strip().title()

    if result == "Pass":
        encoded_result = 1
    else:
        encoded_result = 0

    cleaned_students.append({
        "name": name,
        "attendance": attendance,
        "marks": marks,
        "result_encoded": encoded_result
    })

for record in cleaned_students:
    print(record)

Output:
{'name': 'Amin', 'attendance': 85, 'marks': 76, 'result_encoded': 1}
{'name': 'Mei Ling', 'attendance': 87.5, 'marks': 88, 'result_encoded': 1}
{'name': 'Ravi', 'attendance': 90, 'marks': 100, 'result_encoded': 1}

Learning Note:
This example converts messy raw data into structured model-ready records.

3.15 Preparing Data with Pandas

In professional ML workflows, Python libraries such as Pandas are used to handle datasets more efficiently. Pandas can read CSV files, clean columns, handle missing values and prepare data tables.

Example Pandas Workflow

import pandas as pd

data = pd.read_csv("students.csv")

print(data.head())

data["Name"] = data["Name"].str.strip().str.title()

data["Attendance"] = data["Attendance"].fillna(data["Attendance"].mean())

data = data.drop_duplicates()

print(data.head())

Note:
Learners should first understand the logic using basic Python. Pandas makes the same work faster for larger datasets.

3.16 Common Beginner Mistakes

Mistake	Problem	Correction
Using raw data directly	Model may produce poor results.	Clean and preprocess data first.
Ignoring missing values	Training may fail or become inaccurate.	Remove or fill missing values properly.
Not encoding text categories	Model cannot process raw text labels.	Convert categories into numbers.
Keeping irrelevant features	Model may learn weak or misleading patterns.	Select meaningful features.
Not splitting data	Cannot properly test model performance.	Use training and testing data.

3.17 Hands-On Practice Activities

Activity 1: Identify Data Problems

Prepare a small table of student data with missing values, duplicate records and inconsistent labels. Identify all problems in the dataset.

Activity 2: Clean Text Data

Write a Python program that cleans student names using strip() and title().

Activity 3: Handle Missing Values

Create a list of attendance values with one missing value. Replace the missing value with the average.

Activity 4: Encode Labels

Create a program that converts Pass into 1 and Fail into 0.

Mini Project: Student Dataset Preprocessor

Create a Python program that accepts messy student records and outputs cleaned records with formatted names, filled attendance, corrected marks and encoded results.

3.18 Interactive Final Assessment Quiz

Each correct answer gives +1 mark.
Each wrong answer gives -0.5 mark.

Instructions:
Select the correct answer for each question and click Submit Assessment.

1. Why is preprocessing important in Machine Learning?

It makes raw data suitable for model training It removes the need for data It only changes website design It deletes all records

2. Which problem occurs when important values are blank?

Duplicate data Missing values Feature scaling Encoding

3. Which method removes extra spaces from text?

append() strip() sort() split_data()

4. Encoding is used to:

Convert categories into numeric form Make data disappear Increase missing values Remove Python

5. Feature scaling helps convert numeric values into similar ranges.

True False

6. Duplicate records can bias a dataset.

True False

7. Training data is used to teach a Machine Learning model.

True False

8. Testing data is used to evaluate model performance.

True False

9. Irrelevant features can reduce model quality.

True False

10. A quality dataset should be accurate, complete, consistent and relevant.

True False

Your Score: 0

3.19 Chapter Summary

In this chapter, learners studied data collection and preprocessing for Machine Learning. They learned how to identify data sources, understand data quality, clean missing and duplicate data, transform values, encode categories, scale features, select relevant inputs and prepare structured datasets.

Remember:
A Machine Learning model is only as good as the data used to train it.
Data preprocessing is a critical skill for developing structured and quality datasets.