Chapter 3: Data Collection & Preprocessing

Clean, transform and prepare datasets for Machine Learning. Develop structured and quality datasets that are reliable, consistent, model-ready and useful for AI development.

Data Collection Cleaning Transformation Preprocessing Quality Dataset
Raw
Data
Clean
Data
Feature
Ready
ML
Dataset

3.1 Chapter Overview

Machine Learning models depend heavily on data quality. A model trained with poor, incomplete, inconsistent or biased data will usually produce poor predictions. This is why data collection and preprocessing are among the most important stages in any AI or ML project.

Data preprocessing means preparing raw data before it is used for model training. This may include removing duplicates, handling missing values, correcting formats, converting text categories into numbers, scaling values, selecting features and splitting data into training and testing sets.

Key Learning Focus: Good Machine Learning begins with good data. Clean, structured and quality datasets improve model accuracy, reliability and usefulness.

3.2 Learning Objectives

  • Understand the importance of data collection in Machine Learning projects.
  • Identify different sources of data for AI and ML development.
  • Recognize common data quality problems.
  • Clean missing, duplicate and inconsistent data.
  • Transform raw data into structured datasets.
  • Encode categorical data into machine-readable format.
  • Scale numeric features for better model performance.
  • Prepare structured and quality datasets for Machine Learning workflows.

Learning Outcome

By the end of this chapter, learners will be able to develop structured and quality datasets for Machine Learning by applying data collection, cleaning, transformation and preprocessing techniques.

3.3 What is Data Collection?

Data collection is the process of gathering information that will be used for analysis, prediction or model training. In Machine Learning, data is the foundation because the model learns patterns from examples.

Common Data Sources

Data Source Example Use in ML
CSV / Excel Files Student marks, sales records, attendance sheets Common source for beginner ML projects
Databases Customer records, inventory systems Used in business and enterprise applications
APIs Weather, finance, social media data Used for live or updated data
Sensors / IoT Temperature, machine vibration, traffic data Used in smart manufacturing and automation
Web Forms Survey responses, registration forms Used for collecting user information
Example: For a student performance prediction model, useful data may include attendance percentage, assignment marks, quiz scores, study hours and final exam results.

3.4 Types of Data in ML Datasets

Before preprocessing data, it is important to understand what type of data is being collected. Different data types require different cleaning and transformation methods.

Data Type Description Example
Numeric Data Numbers used for calculation. Age, marks, price, attendance percentage
Categorical Data Data grouped into categories. Gender, course type, city, result
Text Data Free-form written content. Feedback, comments, review text
Date / Time Data Time-based information. Registration date, login time
Boolean Data True or False values. Payment completed, certificate issued

3.5 What is Data Quality?

Data quality means how suitable, accurate and reliable the data is for analysis and model training. Poor data quality can cause incorrect predictions and misleading conclusions.

Important Data Quality Dimensions

Accuracy

Data should represent the correct value. Example: marks should be entered correctly.

Completeness

Important values should not be missing. Example: attendance should not be blank.

Consistency

Data should follow the same format. Example: course names should be written consistently.

Relevance

Data should be useful for the ML problem. Example: shoe size may not help predict exam results.

Common Data Problems

Problem Example Impact
Missing Values Attendance is blank Model may not learn correctly
Duplicate Records Same student appears twice Results may become biased
Incorrect Format Marks stored as text Calculation may fail
Inconsistent Labels Pass, pass, PASSED Categories become confusing
Outliers Marks = 999 Can distort analysis

3.6 Data Cleaning

Data cleaning is the process of fixing problems in raw data. It may include correcting spelling, removing duplicates, handling missing values, converting data types and standardizing formats.

Example Raw Student Data

Name Attendance Marks Result
amin 85 76 pass
Mei Ling 88 PASS
Ravi 90 999 Pass

This dataset has extra spaces, missing attendance, inconsistent result labels and an unrealistic mark value. Before using it for ML, these issues must be fixed.

Python Example: Cleaning Text Values

name = "  amin  "
result = "pass"

clean_name = name.strip().title()
clean_result = result.strip().title()

print(clean_name)
print(clean_result)
Output:
Amin
Pass

3.7 Handling Missing Values

Missing values occur when data is not available or not entered. Missing values must be handled carefully because Machine Learning models usually cannot train properly with blank values.

Common Ways to Handle Missing Values

Method Explanation Example
Remove Records Delete rows with missing values. Remove student record with no marks
Fill with Mean Use average value for missing numeric data. Fill missing attendance with average attendance
Fill with Mode Use most common value for missing category data. Fill missing result with most common result
Use Placeholder Use a fixed value when data is unknown. Fill missing city with "Unknown"

Python Example: Fill Missing Attendance

attendance = [85, None, 90, 75]

valid_attendance = [value for value in attendance if value is not None]
average_attendance = sum(valid_attendance) / len(valid_attendance)

cleaned_attendance = []

for value in attendance:
    if value is None:
        cleaned_attendance.append(average_attendance)
    else:
        cleaned_attendance.append(value)

print(cleaned_attendance)
Output:
[85, 83.33333333333333, 90, 75]
Learning Note: Filling missing values is also called imputation.

3.8 Removing Duplicate Data

Duplicate records can create bias in Machine Learning. If the same record appears many times, the model may give that pattern too much importance.

Python Example: Remove Duplicate Student Names

students = ["Amin", "Mei Ling", "Amin", "Ravi", "Ravi"]

unique_students = list(set(students))

print(unique_students)
Output:
['Amin', 'Mei Ling', 'Ravi']

Note: The order may appear differently because sets are unordered.

3.9 Data Transformation

Data transformation means changing data into a more useful format. Machine Learning models require data to be consistent, structured and mostly numeric.

Examples of Data Transformation

Raw Data Transformation Model-Ready Data
" pass " strip() and title() "Pass"
"85" Convert to integer 85
"Male" Encode category 1
90, 80, 70 Scale values 0.9, 0.8, 0.7

Python Example: Convert Text Number to Integer

marks_text = "85"

marks_number = int(marks_text)

print(marks_number + 10)
Output:
95

3.10 Encoding Categorical Data

Machine Learning models usually work with numbers. Therefore, text categories must often be converted into numeric form. This process is called encoding.

Example: Encoding Result Labels

Category Encoded Value
Fail 0
Pass 1
result = "Pass"

if result == "Pass":
    encoded_result = 1
else:
    encoded_result = 0

print(encoded_result)
Output:
1

Example: Encoding Course Categories

course = "AI"

course_mapping = {
    "Python": 0,
    "AI": 1,
    "Data Science": 2
}

encoded_course = course_mapping[course]

print(encoded_course)
Output:
1

3.11 Feature Scaling

Feature scaling means converting numeric values into a similar range. This helps some Machine Learning algorithms perform better because large values do not dominate smaller values.

Example

If one feature is marks from 0 to 100 and another feature is income from 0 to 100000, the income value may dominate the model unless scaling is applied.

Simple Scaling Formula

scaled_value = value / maximum_value

Python Example: Scale Marks

marks = [50, 75, 100]

scaled_marks = []

for mark in marks:
    scaled_marks.append(mark / 100)

print(scaled_marks)
Output:
[0.5, 0.75, 1.0]

3.12 Feature Selection

Feature selection means choosing the most useful input variables for a Machine Learning model. Not every column in a dataset is useful for prediction.

Example: Student Pass Prediction

Feature Useful? Reason
Attendance Yes Strongly related to performance
Assignment Marks Yes Shows learning progress
Study Hours Yes May affect exam performance
Favourite Colour No Not relevant to academic result
Key Idea: Good features help the model learn meaningful patterns. Irrelevant features may reduce model quality.

3.13 Train-Test Split

In Machine Learning, data is usually divided into training data and testing data. Training data is used to teach the model, while testing data is used to check how well the model performs.

1Full Dataset
2Training Data
3Testing Data
4Evaluate Model

Common Split

Data Portion Purpose
80% Training Data Used to train the model
20% Testing Data Used to evaluate model performance

Simple Python Example

data = ["Record1", "Record2", "Record3", "Record4", "Record5"]

training_data = data[:4]
testing_data = data[4:]

print("Training:", training_data)
print("Testing:", testing_data)
Output:
Training: ['Record1', 'Record2', 'Record3', 'Record4']
Testing: ['Record5']

3.14 Practical Example: Complete Preprocessing Workflow

The following example demonstrates a beginner-friendly preprocessing workflow using plain Python. It cleans names, handles missing attendance, validates marks, encodes results and prepares structured records.

students = [
    {"name": " amin ", "attendance": 85, "marks": 76, "result": "pass"},
    {"name": "mei ling", "attendance": None, "marks": 88, "result": "PASS"},
    {"name": "ravi", "attendance": 90, "marks": 999, "result": "Pass"}
]

valid_attendance = [
    student["attendance"] for student in students
    if student["attendance"] is not None
]

average_attendance = sum(valid_attendance) / len(valid_attendance)

cleaned_students = []

for student in students:
    name = student["name"].strip().title()

    if student["attendance"] is None:
        attendance = average_attendance
    else:
        attendance = student["attendance"]

    if student["marks"] > 100:
        marks = 100
    else:
        marks = student["marks"]

    result = student["result"].strip().title()

    if result == "Pass":
        encoded_result = 1
    else:
        encoded_result = 0

    cleaned_students.append({
        "name": name,
        "attendance": attendance,
        "marks": marks,
        "result_encoded": encoded_result
    })

for record in cleaned_students:
    print(record)
Output:
{'name': 'Amin', 'attendance': 85, 'marks': 76, 'result_encoded': 1}
{'name': 'Mei Ling', 'attendance': 87.5, 'marks': 88, 'result_encoded': 1}
{'name': 'Ravi', 'attendance': 90, 'marks': 100, 'result_encoded': 1}
Learning Note: This example converts messy raw data into structured model-ready records.

3.15 Preparing Data with Pandas

In professional ML workflows, Python libraries such as Pandas are used to handle datasets more efficiently. Pandas can read CSV files, clean columns, handle missing values and prepare data tables.

Example Pandas Workflow

import pandas as pd

data = pd.read_csv("students.csv")

print(data.head())

data["Name"] = data["Name"].str.strip().str.title()

data["Attendance"] = data["Attendance"].fillna(data["Attendance"].mean())

data = data.drop_duplicates()

print(data.head())
Note: Learners should first understand the logic using basic Python. Pandas makes the same work faster for larger datasets.

3.16 Common Beginner Mistakes

Mistake Problem Correction
Using raw data directly Model may produce poor results. Clean and preprocess data first.
Ignoring missing values Training may fail or become inaccurate. Remove or fill missing values properly.
Not encoding text categories Model cannot process raw text labels. Convert categories into numbers.
Keeping irrelevant features Model may learn weak or misleading patterns. Select meaningful features.
Not splitting data Cannot properly test model performance. Use training and testing data.

3.17 Hands-On Practice Activities

Activity 1: Identify Data Problems

Prepare a small table of student data with missing values, duplicate records and inconsistent labels. Identify all problems in the dataset.

Activity 2: Clean Text Data

Write a Python program that cleans student names using strip() and title().

Activity 3: Handle Missing Values

Create a list of attendance values with one missing value. Replace the missing value with the average.

Activity 4: Encode Labels

Create a program that converts Pass into 1 and Fail into 0.

Mini Project: Student Dataset Preprocessor

Create a Python program that accepts messy student records and outputs cleaned records with formatted names, filled attendance, corrected marks and encoded results.

3.18 Interactive Final Assessment Quiz

Each correct answer gives +1 mark.
Each wrong answer gives -0.5 mark.

Instructions: Select the correct answer for each question and click Submit Assessment.

1. Why is preprocessing important in Machine Learning?

2. Which problem occurs when important values are blank?

3. Which method removes extra spaces from text?

4. Encoding is used to:

5. Feature scaling helps convert numeric values into similar ranges.

6. Duplicate records can bias a dataset.

7. Training data is used to teach a Machine Learning model.

8. Testing data is used to evaluate model performance.

9. Irrelevant features can reduce model quality.

10. A quality dataset should be accurate, complete, consistent and relevant.

Your Score: 0

3.19 Chapter Summary

In this chapter, learners studied data collection and preprocessing for Machine Learning. They learned how to identify data sources, understand data quality, clean missing and duplicate data, transform values, encode categories, scale features, select relevant inputs and prepare structured datasets.

Remember: A Machine Learning model is only as good as the data used to train it. Data preprocessing is a critical skill for developing structured and quality datasets.