Data preprocessing in machine learning and data science, different methods of data cleaning
Data Cleaning
Mandatory
when data contains noise, errors, or missing values.
Data cleaning is a critical step in the data preprocessing
phase of any data analysis, machine learning, or data science project. It
involves detecting and correcting (or removing) errors and inconsistencies in
the data to improve its quality.
Purpose of
Data Cleaning
- Ensure accuracy and reliability
of data
- Improve model performance
- Reduce noise and bias
- Avoid misleading analysis or
decisions
Following
are the key tasks in data cleaning;
- Handling missing values:
- Imputation
(mean/median/mode, KNN, regression) - the process of replacing missing data in a
dataset with estimated values so that machine learning models can
be trained effectively without errors.
Type |
Description |
Suitable For |
Mean Imputation |
Replace
missing value with mean of the column |
Numerical
data |
Median Imputation |
Replace
missing value with median |
Numerical
(with outliers) |
Mode Imputation |
Replace
with most frequent value |
Categorical
data |
Constant Imputation |
Replace
with a specific constant (e.g., "Unknown", 0) |
Both types |
Forward/Backward Fill |
Use
previous or next value (time series) |
Time
series data |
KNN Imputation |
Use
nearest neighbors to estimate missing value |
Mixed-type
features |
Model-Based Imputation |
Use ML
models (e.g., regression) to predict missing values |
Complex
datasets |
- Deletion (row or column)
- Deleting rows from the dataset. When to use row deletion?
- The number of missing rows
is small, and deleting them won’t hurt the dataset.
- The
missing values are completely random.
- You
don’t want to introduce assumptions via imputation.
- Deleting
columns (features) from the dataset. When to use column
deletion?
§ A column has too many missing values (e.g., >50% missing).
§ The column is not informative or redundant.
§ You plan to drop high-cardinality columns that are hard to encode.
- Removing duplicates
- Removing
duplicate rows that repeat the same information in the dataset. Why it's
important?
§ Prevents model from being biased due to repeated records.
§ Reduces dataset size and improves model efficiency.
- Correcting
inconsistent entries (e.g., spelling errors)
o
Fixing inconsistencies in data formatting or
spelling (e.g., "NY"
,
"ny"
, "New York"
). Why it’s
important?
§ Ensures uniformity, especially in categorical features.
§ Helps correct values that should logically be grouped together.
- Outlier detection and treatment
o Identifying and addressing extreme values that deviate significantly from the rest of the data. Why it’s important?
§ Outliers can distort model training, especially in regression or clustering.
§ May represent data entry errors or rare cases
Example code:
Here’s a Python guide using Pandas, NumPy, and Scikit-learn to demonstrate Data Cleaning step-by-step with examples:
Handling Missing Values
Sample Data
import pandas as pd
import numpy as np
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'Age': [25, np.nan, 30, 22, np.nan],
'Gender': ['F', 'M', np.nan, 'M', 'F']
}
df = pd.DataFrame(data)
print(df)
a. Imputation
Mean/Median Imputation
df['Age'].fillna(df['Age'].mean(), inplace=True) # Or use median()
Mode Imputation for Categorical
df['Gender'].fillna(df['Gender'].mode()[0], inplace=True)
b. Deletion
Drop rows with missing values
df.dropna(inplace=True)
Drop columns with too many NaNs
df.dropna(axis=1, thresh=3, inplace=True) # Keep columns with >= 3 non-NaN
2. Removing Duplicates
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Bob', 'Charlie'],
'Age': [25, 30, 30, 35]
})
df.drop_duplicates(inplace=True)
3. Correcting Inconsistent Entries
df = pd.DataFrame({
'City': ['New York', 'new york', 'NEW YORK', 'Boston', 'boston']
})
# Standardize entries
df['City'] = df['City'].str.lower().str.strip()
4. Outlier Detection and Treatment
a. Z-Score Method (for normal distribution)
from scipy import stats
z_scores = np.abs(stats.zscore(df['Age']))
df = df[z_scores < 3] # Keep rows where z < 3
b. IQR Method (for skewed data)
Q1 = df['Age'].quantile(0.25)
Q3 = df['Age'].quantile(0.75)
IQR = Q3 - Q1
df = df[(df['Age'] >= Q1 - 1.5 * IQR) & (df['Age'] <= Q3 + 1.5 * IQR)]
Summary Table
Task |
Method |
Function Used |
Missing values (numeric) |
Mean/Median |
|
Missing values (categorical) |
Mode |
|
Deletion |
Drop rows/columns |
|
Duplicates |
Remove duplicates |
|
Inconsistencies |
Standardize strings |
|
Outliers (normal) |
Z-score |
|
Outliers (non-normal) |
IQR |
|
Comprehensive summary of various methods used to handle missing data during the data cleaning step of data preprocessing
Method |
Type |
Purpose |
When to
Use |
Mandatory? |
Python
Example |
Deletion
(Drop rows) |
Removal |
Remove
rows with missing values |
When missing
data is minimal and random |
Sometimes |
df.dropna() |
Deletion
(Drop columns) |
Removal |
Remove
entire columns with many missing values |
When a
column has >50–70% missing values, or is irrelevant |
Sometimes |
df.drop(columns=['col']) |
Mean/Median
Imputation |
Imputation |
Fill
numeric missing values with mean or median |
When data
is MCAR (Missing Completely At Random) |
Yes (if
needed) |
df['col'].fillna(df['col'].mean()) |
Mode
Imputation |
Imputation |
Fill
categorical missing values with most frequent value |
For categorical
features with few missing values |
Yes (if
needed) |
df['col'].fillna(df['col'].mode()[0]) |
Forward/Backward
Fill |
Time-based |
Propagate
previous/next values |
For
time-series or ordered data with small gaps |
Optional |
df.fillna(method='ffill') |
Custom/Constant
Imputation |
Imputation |
Replace
missing values with a fixed value |
When you
want to preserve missing indicator, e.g., 'unknown' |
Optional |
df.fillna('unknown') |
KNN
Imputation |
Advanced
Imputation |
Use
similar data points to fill values |
When feature
relationships are important and data is not MCAR |
Optional |
from
sklearn.impute import KNNImputer |
Multivariate
Imputation (MICE) |
Advanced |
Iteratively
models each feature with missing values |
When you
want a more statistically sound imputation |
Optional |
from
sklearn.experimental import enable_iterative_imputer |
Drop if
all missing |
Removal |
Drop
rows/columns with all values missing |
Safe
default to clean totally empty rows/columns |
Sometimes |
df.dropna(how='all') |
Missing
Indicator Column |
Feature Engineering |
Add binary
column to track where values were missing |
When you
want the model to learn from missingness pattern |
Optional |
SimpleImputer(add_indicator=True) |
Link to Data preprocessing Home
No comments:
Post a Comment