Chapter 3: Data Collection & Preprocessing
Clean, transform and prepare datasets for Machine Learning. Develop structured and quality datasets that are reliable, consistent, model-ready and useful for AI development.
Data
Data
Ready
Dataset
3.1 Chapter Overview
Machine Learning models depend heavily on data quality. A model trained with poor, incomplete, inconsistent or biased data will usually produce poor predictions. This is why data collection and preprocessing are among the most important stages in any AI or ML project.
Data preprocessing means preparing raw data before it is used for model training. This may include removing duplicates, handling missing values, correcting formats, converting text categories into numbers, scaling values, selecting features and splitting data into training and testing sets.
3.2 Learning Objectives
- Understand the importance of data collection in Machine Learning projects.
- Identify different sources of data for AI and ML development.
- Recognize common data quality problems.
- Clean missing, duplicate and inconsistent data.
- Transform raw data into structured datasets.
- Encode categorical data into machine-readable format.
- Scale numeric features for better model performance.
- Prepare structured and quality datasets for Machine Learning workflows.
Learning Outcome
By the end of this chapter, learners will be able to develop structured and quality datasets for Machine Learning by applying data collection, cleaning, transformation and preprocessing techniques.
3.3 What is Data Collection?
Data collection is the process of gathering information that will be used for analysis, prediction or model training. In Machine Learning, data is the foundation because the model learns patterns from examples.
Common Data Sources
| Data Source | Example | Use in ML |
|---|---|---|
| CSV / Excel Files | Student marks, sales records, attendance sheets | Common source for beginner ML projects |
| Databases | Customer records, inventory systems | Used in business and enterprise applications |
| APIs | Weather, finance, social media data | Used for live or updated data |
| Sensors / IoT | Temperature, machine vibration, traffic data | Used in smart manufacturing and automation |
| Web Forms | Survey responses, registration forms | Used for collecting user information |
3.4 Types of Data in ML Datasets
Before preprocessing data, it is important to understand what type of data is being collected. Different data types require different cleaning and transformation methods.
| Data Type | Description | Example |
|---|---|---|
| Numeric Data | Numbers used for calculation. | Age, marks, price, attendance percentage |
| Categorical Data | Data grouped into categories. | Gender, course type, city, result |
| Text Data | Free-form written content. | Feedback, comments, review text |
| Date / Time Data | Time-based information. | Registration date, login time |
| Boolean Data | True or False values. | Payment completed, certificate issued |
3.5 What is Data Quality?
Data quality means how suitable, accurate and reliable the data is for analysis and model training. Poor data quality can cause incorrect predictions and misleading conclusions.
Important Data Quality Dimensions
Accuracy
Data should represent the correct value. Example: marks should be entered correctly.
Completeness
Important values should not be missing. Example: attendance should not be blank.
Consistency
Data should follow the same format. Example: course names should be written consistently.
Relevance
Data should be useful for the ML problem. Example: shoe size may not help predict exam results.
Common Data Problems
| Problem | Example | Impact |
|---|---|---|
| Missing Values | Attendance is blank | Model may not learn correctly |
| Duplicate Records | Same student appears twice | Results may become biased |
| Incorrect Format | Marks stored as text | Calculation may fail |
| Inconsistent Labels | Pass, pass, PASSED | Categories become confusing |
| Outliers | Marks = 999 | Can distort analysis |
3.6 Data Cleaning
Data cleaning is the process of fixing problems in raw data. It may include correcting spelling, removing duplicates, handling missing values, converting data types and standardizing formats.
Example Raw Student Data
| Name | Attendance | Marks | Result |
|---|---|---|---|
| amin | 85 | 76 | pass |
| Mei Ling | 88 | PASS | |
| Ravi | 90 | 999 | Pass |
This dataset has extra spaces, missing attendance, inconsistent result labels and an unrealistic mark value. Before using it for ML, these issues must be fixed.
Python Example: Cleaning Text Values
name = " amin " result = "pass" clean_name = name.strip().title() clean_result = result.strip().title() print(clean_name) print(clean_result)
Amin
Pass
3.7 Handling Missing Values
Missing values occur when data is not available or not entered. Missing values must be handled carefully because Machine Learning models usually cannot train properly with blank values.
Common Ways to Handle Missing Values
| Method | Explanation | Example |
|---|---|---|
| Remove Records | Delete rows with missing values. | Remove student record with no marks |
| Fill with Mean | Use average value for missing numeric data. | Fill missing attendance with average attendance |
| Fill with Mode | Use most common value for missing category data. | Fill missing result with most common result |
| Use Placeholder | Use a fixed value when data is unknown. | Fill missing city with "Unknown" |
Python Example: Fill Missing Attendance
attendance = [85, None, 90, 75]
valid_attendance = [value for value in attendance if value is not None]
average_attendance = sum(valid_attendance) / len(valid_attendance)
cleaned_attendance = []
for value in attendance:
if value is None:
cleaned_attendance.append(average_attendance)
else:
cleaned_attendance.append(value)
print(cleaned_attendance)
[85, 83.33333333333333, 90, 75]
3.8 Removing Duplicate Data
Duplicate records can create bias in Machine Learning. If the same record appears many times, the model may give that pattern too much importance.
Python Example: Remove Duplicate Student Names
students = ["Amin", "Mei Ling", "Amin", "Ravi", "Ravi"] unique_students = list(set(students)) print(unique_students)
['Amin', 'Mei Ling', 'Ravi']
Note: The order may appear differently because sets are unordered.
3.9 Data Transformation
Data transformation means changing data into a more useful format. Machine Learning models require data to be consistent, structured and mostly numeric.
Examples of Data Transformation
| Raw Data | Transformation | Model-Ready Data |
|---|---|---|
| " pass " | strip() and title() | "Pass" |
| "85" | Convert to integer | 85 |
| "Male" | Encode category | 1 |
| 90, 80, 70 | Scale values | 0.9, 0.8, 0.7 |
Python Example: Convert Text Number to Integer
marks_text = "85" marks_number = int(marks_text) print(marks_number + 10)
95
3.10 Encoding Categorical Data
Machine Learning models usually work with numbers. Therefore, text categories must often be converted into numeric form. This process is called encoding.
Example: Encoding Result Labels
| Category | Encoded Value |
|---|---|
| Fail | 0 |
| Pass | 1 |
result = "Pass"
if result == "Pass":
encoded_result = 1
else:
encoded_result = 0
print(encoded_result)
1
Example: Encoding Course Categories
course = "AI"
course_mapping = {
"Python": 0,
"AI": 1,
"Data Science": 2
}
encoded_course = course_mapping[course]
print(encoded_course)
1
3.11 Feature Scaling
Feature scaling means converting numeric values into a similar range. This helps some Machine Learning algorithms perform better because large values do not dominate smaller values.
Example
If one feature is marks from 0 to 100 and another feature is income from 0 to 100000, the income value may dominate the model unless scaling is applied.
Simple Scaling Formula
scaled_value = value / maximum_value
Python Example: Scale Marks
marks = [50, 75, 100]
scaled_marks = []
for mark in marks:
scaled_marks.append(mark / 100)
print(scaled_marks)
[0.5, 0.75, 1.0]
3.12 Feature Selection
Feature selection means choosing the most useful input variables for a Machine Learning model. Not every column in a dataset is useful for prediction.
Example: Student Pass Prediction
| Feature | Useful? | Reason |
|---|---|---|
| Attendance | Yes | Strongly related to performance |
| Assignment Marks | Yes | Shows learning progress |
| Study Hours | Yes | May affect exam performance |
| Favourite Colour | No | Not relevant to academic result |
3.13 Train-Test Split
In Machine Learning, data is usually divided into training data and testing data. Training data is used to teach the model, while testing data is used to check how well the model performs.
Common Split
| Data Portion | Purpose |
|---|---|
| 80% Training Data | Used to train the model |
| 20% Testing Data | Used to evaluate model performance |
Simple Python Example
data = ["Record1", "Record2", "Record3", "Record4", "Record5"]
training_data = data[:4]
testing_data = data[4:]
print("Training:", training_data)
print("Testing:", testing_data)
Training: ['Record1', 'Record2', 'Record3', 'Record4']
Testing: ['Record5']
3.14 Practical Example: Complete Preprocessing Workflow
The following example demonstrates a beginner-friendly preprocessing workflow using plain Python. It cleans names, handles missing attendance, validates marks, encodes results and prepares structured records.
students = [
{"name": " amin ", "attendance": 85, "marks": 76, "result": "pass"},
{"name": "mei ling", "attendance": None, "marks": 88, "result": "PASS"},
{"name": "ravi", "attendance": 90, "marks": 999, "result": "Pass"}
]
valid_attendance = [
student["attendance"] for student in students
if student["attendance"] is not None
]
average_attendance = sum(valid_attendance) / len(valid_attendance)
cleaned_students = []
for student in students:
name = student["name"].strip().title()
if student["attendance"] is None:
attendance = average_attendance
else:
attendance = student["attendance"]
if student["marks"] > 100:
marks = 100
else:
marks = student["marks"]
result = student["result"].strip().title()
if result == "Pass":
encoded_result = 1
else:
encoded_result = 0
cleaned_students.append({
"name": name,
"attendance": attendance,
"marks": marks,
"result_encoded": encoded_result
})
for record in cleaned_students:
print(record)
{'name': 'Amin', 'attendance': 85, 'marks': 76, 'result_encoded': 1}
{'name': 'Mei Ling', 'attendance': 87.5, 'marks': 88, 'result_encoded': 1}
{'name': 'Ravi', 'attendance': 90, 'marks': 100, 'result_encoded': 1}
3.15 Preparing Data with Pandas
In professional ML workflows, Python libraries such as Pandas are used to handle datasets more efficiently. Pandas can read CSV files, clean columns, handle missing values and prepare data tables.
Example Pandas Workflow
import pandas as pd
data = pd.read_csv("students.csv")
print(data.head())
data["Name"] = data["Name"].str.strip().str.title()
data["Attendance"] = data["Attendance"].fillna(data["Attendance"].mean())
data = data.drop_duplicates()
print(data.head())
3.16 Common Beginner Mistakes
| Mistake | Problem | Correction |
|---|---|---|
| Using raw data directly | Model may produce poor results. | Clean and preprocess data first. |
| Ignoring missing values | Training may fail or become inaccurate. | Remove or fill missing values properly. |
| Not encoding text categories | Model cannot process raw text labels. | Convert categories into numbers. |
| Keeping irrelevant features | Model may learn weak or misleading patterns. | Select meaningful features. |
| Not splitting data | Cannot properly test model performance. | Use training and testing data. |
3.17 Hands-On Practice Activities
Activity 1: Identify Data Problems
Prepare a small table of student data with missing values, duplicate records and inconsistent labels. Identify all problems in the dataset.
Activity 2: Clean Text Data
Write a Python program that cleans student names using strip() and title().
Activity 3: Handle Missing Values
Create a list of attendance values with one missing value. Replace the missing value with the average.
Activity 4: Encode Labels
Create a program that converts Pass into 1 and Fail into 0.
Mini Project: Student Dataset Preprocessor
Create a Python program that accepts messy student records and outputs cleaned records with formatted names, filled attendance, corrected marks and encoded results.
3.18 Interactive Final Assessment Quiz
Each correct answer gives +1 mark.
Each wrong answer gives -0.5 mark.
1. Why is preprocessing important in Machine Learning?
2. Which problem occurs when important values are blank?
3. Which method removes extra spaces from text?
4. Encoding is used to:
5. Feature scaling helps convert numeric values into similar ranges.
6. Duplicate records can bias a dataset.
7. Training data is used to teach a Machine Learning model.
8. Testing data is used to evaluate model performance.
9. Irrelevant features can reduce model quality.
10. A quality dataset should be accurate, complete, consistent and relevant.
Your Score: 0
3.19 Chapter Summary
In this chapter, learners studied data collection and preprocessing for Machine Learning. They learned how to identify data sources, understand data quality, clean missing and duplicate data, transform values, encode categories, scale features, select relevant inputs and prepare structured datasets.