Data Transformation and Feature Scaling in Data Preprocessing

Data transformation in data preprocessing, what is feature scaling and why it is important?

Data Transformation / Feature Scaling

Mandatory for algorithms that are sensitive to feature scale.

Data Transformation – Data transformation refers to changing the format, structure, or distribution of your data to make it more suitable for analysis or modeling.

Feature Scaling – Feature scaling is a specific kind of data transformation that focuses on rescaling numeric features so they have similar ranges or distributions.

Following are few of the important data transformation / feature scaling methods;

Normalization (Min-Max Scaling): Scales values between 0 and 1. It does not handle outliers.

When to use?

When features are on a different scale
Required for algorithms like KNN, SVM, Neural Networks

o Formula:

$X' = \frac{X - X_{\text{min}}}{X_{\text{max}} - X_{\text{min}}}$

Python Example:

from sklearn.preprocessing import MinMaxScaler

import pandas as pd

df = pd.DataFrame({'Salary': [20000, 50000, 100000, 150000]})

scaler = MinMaxScaler()

df['Normalized_Salary'] = scaler.fit_transform(df[['Salary']])

print(df)

Standardization (Z-score Scaling): Scales values to have mean 0 and standard deviation 1. It detects outliers.

When to use?

When data should have mean = 0 and standard deviation = 1
Needed for linear regression, logistic regression, PCA

Formula:

X' = \frac{X - \mu}{\sigma}

Python Example:

from sklearn.preprocessing import StandardScaler

df = pd.DataFrame({'Salary': [20000, 50000, 100000, 150000]})

scaler = StandardScaler()

df['Standardized_Salary'] = scaler.fit_transform(df[['Salary']])

print(df)

Log Transformation: For skewed distributions

When to use?

To reduce right skewness (Right skewness (also called positive skew) refers to a distribution of data where most values are concentrated on the left, but the tail extends to the right.) in the data.
Help stabilize variance.

Python Example:

import numpy as np

df = pd.DataFrame({'Income': [1000, 10000, 100000, 1000000]})

df['Log_Income'] = np.log(df['Income'])

print(df)

Box-Cox Transformation: Advanced transformation for skewed distributions. Works only with positive values.

Python Example:

from scipy.stats import boxcox

df = pd.DataFrame({'Sales': [1, 5, 10, 50, 100]})

# Apply Box-Cox transformation

df['BoxCox_Sales'], _ = boxcox(df['Sales'])

print(df)

Robust Scaling:

When to use?

When data contains outliers.
Uses median and IQR instead of mean and standard deviation.

Python Example:

from sklearn.preprocessing import RobustScaler

df = pd.DataFrame({'Score': [1, 2, 3, 100]})

scaler = RobustScaler()

df['Robust_Score'] = scaler.fit_transform(df[['Score']])

print(df)

TOPICS (Click to Navigate)

Pages

Thursday, August 7, 2025

Data Transformation and Feature Scaling in Data Preprocessing

Link to Data Preprocessing Home

No comments:

Post a Comment