TOPICS (Click to Navigate)

Pages

Thursday, August 7, 2025

Data cleaning in data preprocessing

Data preprocessing in machine learning and data science, different methods of data cleaning


Data Cleaning

Mandatory when data contains noise, errors, or missing values.

Data cleaning is a critical step in the data preprocessing phase of any data analysis, machine learning, or data science project. It involves detecting and correcting (or removing) errors and inconsistencies in the data to improve its quality.

Purpose of Data Cleaning

  • Ensure accuracy and reliability of data
  • Improve model performance
  • Reduce noise and bias
  • Avoid misleading analysis or decisions

Following are the key tasks in data cleaning;

  • Handling missing values:
    • Imputation (mean/median/mode, KNN, regression) - the process of replacing missing data in a dataset with estimated values so that machine learning models can be trained effectively without errors.

Type

Description

Suitable For

Mean Imputation

Replace missing value with mean of the column

Numerical data

Median Imputation

Replace missing value with median

Numerical (with outliers)

Mode Imputation

Replace with most frequent value

Categorical data

Constant Imputation

Replace with a specific constant (e.g., "Unknown", 0)

Both types

Forward/Backward Fill

Use previous or next value (time series)

Time series data

KNN Imputation

Use nearest neighbors to estimate missing value

Mixed-type features

Model-Based Imputation

Use ML models (e.g., regression) to predict missing values

Complex datasets

    • Deletion (row or column)
      • Deleting rows from the dataset. When to use row deletion?
        • The number of missing rows is small, and deleting them won’t hurt the dataset.
        • The missing values are completely random.
        • You don’t want to introduce assumptions via imputation.
      • Deleting columns (features) from the dataset. When to use column deletion?

§  A column has too many missing values (e.g., >50% missing).

§  The column is not informative or redundant.

§  You plan to drop high-cardinality columns that are hard to encode.

  • Removing duplicates
    • Removing duplicate rows that repeat the same information in the dataset. Why it's important?

§  Prevents model from being biased due to repeated records.

§  Reduces dataset size and improves model efficiency.

  • Correcting inconsistent entries (e.g., spelling errors)

o    Fixing inconsistencies in data formatting or spelling (e.g., "NY", "ny", "New York"). Why it’s important?

§  Ensures uniformity, especially in categorical features.

§  Helps correct values that should logically be grouped together.

  • Outlier detection and treatment

o    Identifying and addressing extreme values that deviate significantly from the rest of the data. Why it’s important?

§  Outliers can distort model training, especially in regression or clustering.

§  May represent data entry errors or rare cases

Example code:

Here’s a Python guide using Pandas, NumPy, and Scikit-learn to demonstrate Data Cleaning step-by-step with examples:

 

Handling Missing Values

Sample Data

import pandas as pd
import numpy as np
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'Age': [25, np.nan, 30, 22, np.nan],
    'Gender': ['F', 'M', np.nan, 'M', 'F']
}
df = pd.DataFrame(data)
print(df)

a. Imputation

Mean/Median Imputation

df['Age'].fillna(df['Age'].mean(), inplace=True)  # Or use median()

Mode Imputation for Categorical

df['Gender'].fillna(df['Gender'].mode()[0], inplace=True)

b. Deletion

Drop rows with missing values

df.dropna(inplace=True)

Drop columns with too many NaNs

df.dropna(axis=1, thresh=3, inplace=True)  # Keep columns with >= 3 non-NaN

 

2. Removing Duplicates

df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Bob', 'Charlie'],
    'Age': [25, 30, 30, 35]
})
df.drop_duplicates(inplace=True)

 

3. Correcting Inconsistent Entries

df = pd.DataFrame({
    'City': ['New York', 'new york', 'NEW YORK', 'Boston', 'boston']
})
 
# Standardize entries
df['City'] = df['City'].str.lower().str.strip()

 

4. Outlier Detection and Treatment

a. Z-Score Method (for normal distribution)

from scipy import stats
z_scores = np.abs(stats.zscore(df['Age']))
df = df[z_scores < 3]  # Keep rows where z < 3

b. IQR Method (for skewed data)

Q1 = df['Age'].quantile(0.25)
Q3 = df['Age'].quantile(0.75)
IQR = Q3 - Q1
 
df = df[(df['Age'] >= Q1 - 1.5 * IQR) & (df['Age'] <= Q3 + 1.5 * IQR)]

 

Summary Table

Task

Method

Function Used

Missing values (numeric)

Mean/Median

fillna(df.mean())

Missing values (categorical)

Mode

fillna(df.mode()[0])

Deletion

Drop rows/columns

dropna()

Duplicates

Remove duplicates

drop_duplicates()

Inconsistencies

Standardize strings

str.lower().strip()

Outliers (normal)

Z-score

stats.zscore()

Outliers (non-normal)

IQR

quantile() + logic

 

Comprehensive summary of various methods used to handle missing data during the data cleaning step of data preprocessing


Method

Type

Purpose

When to Use

Mandatory?

Python Example

Deletion (Drop rows)

Removal

Remove rows with missing values

When missing data is minimal and random

Sometimes

df.dropna()

Deletion (Drop columns)

Removal

Remove entire columns with many missing values

When a column has >50–70% missing values, or is irrelevant

Sometimes

df.drop(columns=['col'])

Mean/Median Imputation

Imputation

Fill numeric missing values with mean or median

When data is MCAR (Missing Completely At Random)

Yes (if needed)

df['col'].fillna(df['col'].mean())

Mode Imputation

Imputation

Fill categorical missing values with most frequent value

For categorical features with few missing values

Yes (if needed)

df['col'].fillna(df['col'].mode()[0])

Forward/Backward Fill

Time-based

Propagate previous/next values

For time-series or ordered data with small gaps

Optional

df.fillna(method='ffill')

Custom/Constant Imputation

Imputation

Replace missing values with a fixed value

When you want to preserve missing indicator, e.g., 'unknown'

Optional

df.fillna('unknown')

KNN Imputation

Advanced Imputation

Use similar data points to fill values

When feature relationships are important and data is not MCAR

Optional

from sklearn.impute import KNNImputer

Multivariate Imputation (MICE)

Advanced

Iteratively models each feature with missing values

When you want a more statistically sound imputation

Optional

from sklearn.experimental import enable_iterative_imputer

Drop if all missing

Removal

Drop rows/columns with all values missing

Safe default to clean totally empty rows/columns

Sometimes

df.dropna(how='all')

Missing Indicator Column

Feature Engineering

Add binary column to track where values were missing

When you want the model to learn from missingness pattern

Optional

SimpleImputer(add_indicator=True)

 

Link to Data preprocessing Home

 

No comments:

Post a Comment