Data preprocessing in machine learning and data science, different methods of data cleaning

Data Cleaning

Mandatory when data contains noise, errors, or missing values.

Data cleaning is a critical step in the data preprocessing phase of any data analysis, machine learning, or data science project. It involves detecting and correcting (or removing) errors and inconsistencies in the data to improve its quality.

Purpose of Data Cleaning

Ensure accuracy and reliability of data
Improve model performance
Reduce noise and bias
Avoid misleading analysis or decisions

Following are the key tasks in data cleaning;

Handling missing values:

Imputation (mean/median/mode, KNN, regression) - the process of replacing missing data in a dataset with estimated values so that machine learning models can be trained effectively without errors.

Type	Description	Suitable For
Mean Imputation	Replace missing value with mean of the column	Numerical data
Median Imputation	Replace missing value with median	Numerical (with outliers)
Mode Imputation	Replace with most frequent value	Categorical data
Constant Imputation	Replace with a specific constant (e.g., "Unknown", 0)	Both types
Forward/Backward Fill	Use previous or next value (time series)	Time series data
KNN Imputation	Use nearest neighbors to estimate missing value	Mixed-type features
Model-Based Imputation	Use ML models (e.g., regression) to predict missing values	Complex datasets

Deletion (row or column)

Deleting rows from the dataset. When to use row deletion?

The number of missing rows is small, and deleting them won’t hurt the dataset.
The missing values are completely random.
You don’t want to introduce assumptions via imputation.

Deleting columns (features) from the dataset. When to use column deletion?

§ A column has too many missing values (e.g., >50% missing).

§ The column is not informative or redundant.

§ You plan to drop high-cardinality columns that are hard to encode.

Removing duplicates

Removing duplicate rows that repeat the same information in the dataset. Why it's important?

§ Prevents model from being biased due to repeated records.

§ Reduces dataset size and improves model efficiency.

Correcting inconsistent entries (e.g., spelling errors)

o Fixing inconsistencies in data formatting or spelling (e.g., "NY", "ny", "New York"). Why it’s important?

§ Ensures uniformity, especially in categorical features.

§ Helps correct values that should logically be grouped together.

Outlier detection and treatment

o Identifying and addressing extreme values that deviate significantly from the rest of the data. Why it’s important?

§ Outliers can distort model training, especially in regression or clustering.

§ May represent data entry errors or rare cases

Example code:

Here’s a Python guide using Pandas, NumPy, and Scikit-learn to demonstrate Data Cleaning step-by-step with examples:

Handling Missing Values

Sample Data

import pandas as pd

import numpy as np

data = {

    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],

    'Age': [25, np.nan, 30, 22, np.nan],

    'Gender': ['F', 'M', np.nan, 'M', 'F']

df = pd.DataFrame(data)

print(df)

a. Imputation

Mean/Median Imputation

df['Age'].fillna(df['Age'].mean(), inplace=True)  # Or use median()

Mode Imputation for Categorical

df['Gender'].fillna(df['Gender'].mode()[0], inplace=True)

b. Deletion

Drop rows with missing values

df.dropna(inplace=True)

Drop columns with too many NaNs

df.dropna(axis=1, thresh=3, inplace=True)  # Keep columns with >= 3 non-NaN

2. Removing Duplicates

df = pd.DataFrame({

    'Name': ['Alice', 'Bob', 'Bob', 'Charlie'],

    'Age': [25, 30, 30, 35]

})

df.drop_duplicates(inplace=True)

3. Correcting Inconsistent Entries

df = pd.DataFrame({

    'City': ['New York', 'new york', 'NEW YORK', 'Boston', 'boston']

})

# Standardize entries

df['City'] = df['City'].str.lower().str.strip()

4. Outlier Detection and Treatment

a. Z-Score Method (for normal distribution)

from scipy import stats

z_scores = np.abs(stats.zscore(df['Age']))

df = df[z_scores < 3]  # Keep rows where z < 3

b. IQR Method (for skewed data)

Q1 = df['Age'].quantile(0.25)

Q3 = df['Age'].quantile(0.75)

IQR = Q3 - Q1

df = df[(df['Age'] >= Q1 - 1.5 * IQR) & (df['Age'] <= Q3 + 1.5 * IQR)]

Summary Table

Task	Method	Function Used
Missing values (numeric)	Mean/Median	`fillna(df.mean())`
Missing values (categorical)	Mode	`fillna(df.mode()[0])`
Deletion	Drop rows/columns	`dropna()`
Duplicates	Remove duplicates	`drop_duplicates()`
Inconsistencies	Standardize strings	`str.lower().strip()`
Outliers (normal)	Z-score	`stats.zscore()`
Outliers (non-normal)	IQR	`quantile()` + logic

Comprehensive summary of various methods used to handle missing data during the data cleaning step of data preprocessing

Method	Type	Purpose	When to Use	Mandatory?	Python Example
Deletion (Drop rows)	Removal	Remove rows with missing values	When missing data is minimal and random	Sometimes	df.dropna()
Deletion (Drop columns)	Removal	Remove entire columns with many missing values	When a column has >50–70% missing values, or is irrelevant	Sometimes	df.drop(columns=['col'])
Mean/Median Imputation	Imputation	Fill numeric missing values with mean or median	When data is MCAR (Missing Completely At Random)	Yes (if needed)	df['col'].fillna(df['col'].mean())
Mode Imputation	Imputation	Fill categorical missing values with most frequent value	For categorical features with few missing values	Yes (if needed)	df['col'].fillna(df['col'].mode()[0])
Forward/Backward Fill	Time-based	Propagate previous/next values	For time-series or ordered data with small gaps	Optional	df.fillna(method='ffill')
Custom/Constant Imputation	Imputation	Replace missing values with a fixed value	When you want to preserve missing indicator, e.g., 'unknown'	Optional	df.fillna('unknown')
KNN Imputation	Advanced Imputation	Use similar data points to fill values	When feature relationships are important and data is not MCAR	Optional	from sklearn.impute import KNNImputer
Multivariate Imputation (MICE)	Advanced	Iteratively models each feature with missing values	When you want a more statistically sound imputation	Optional	from sklearn.experimental import enable_iterative_imputer
Drop if all missing	Removal	Drop rows/columns with all values missing	Safe default to clean totally empty rows/columns	Sometimes	df.dropna(how='all')
Missing Indicator Column	Feature Engineering	Add binary column to track where values were missing	When you want the model to learn from missingness pattern	Optional	SimpleImputer(add_indicator=True)

TOPICS (Click to Navigate)

Pages

Thursday, August 7, 2025

Data cleaning in data preprocessing