Computer Science and Engineering - Tutorials, Notes, MCQs, Questions and Answers: Data Transformation and Feature Scaling in Data Preprocessing

Thursday, August 7, 2025

Data Transformation and Feature Scaling in Data Preprocessing

Data transformation in data preprocessing, what is feature scaling and why it is important?

Data Transformation / Feature Scaling

Mandatory for algorithms that are sensitive to feature scale.

Data Transformation – Data transformation refers to changing the format, structure, or distribution of your data to make it more suitable for analysis or modeling.

Feature Scaling – Feature scaling is a specific kind of data transformation that focuses on rescaling numeric features so they have similar ranges or distributions.

Following are few of the important data transformation / feature scaling methods;

Normalization (Min-Max Scaling): Scales values between 0 and 1. It does not handle outliers.

When to use?

When features are on a different scale
Required for algorithms like KNN, SVM, Neural Networks

o Formula:

$X' = \frac{X - X_{\text{min}}}{X_{\text{max}} - X_{\text{min}}}$

Python Example:

from sklearn.preprocessing import MinMaxScaler

import pandas as pd

df = pd.DataFrame({'Salary': [20000, 50000, 100000, 150000]})

scaler = MinMaxScaler()

df['Normalized_Salary'] = scaler.fit_transform(df[['Salary']])

print(df)

Standardization (Z-score Scaling): Scales values to have mean 0 and standard deviation 1. It detects outliers.

When to use?

When data should have mean = 0 and standard deviation = 1
Needed for linear regression, logistic regression, PCA

Formula:

X' = \frac{X - \mu}{\sigma}

Python Example:

from sklearn.preprocessing import StandardScaler

df = pd.DataFrame({'Salary': [20000, 50000, 100000, 150000]})

scaler = StandardScaler()

df['Standardized_Salary'] = scaler.fit_transform(df[['Salary']])

print(df)

Log Transformation: For skewed distributions

When to use?

To reduce right skewness (Right skewness (also called positive skew) refers to a distribution of data where most values are concentrated on the left, but the tail extends to the right.) in the data.
Help stabilize variance.

Python Example:

import numpy as np

df = pd.DataFrame({'Income': [1000, 10000, 100000, 1000000]})

df['Log_Income'] = np.log(df['Income'])

print(df)

Box-Cox Transformation: Advanced transformation for skewed distributions. Works only with positive values.

Python Example:

from scipy.stats import boxcox

df = pd.DataFrame({'Sales': [1, 5, 10, 50, 100]})

# Apply Box-Cox transformation

df['BoxCox_Sales'], _ = boxcox(df['Sales'])

print(df)

Robust Scaling:

When to use?

When data contains outliers.
Uses median and IQR instead of mean and standard deviation.

Python Example:

from sklearn.preprocessing import RobustScaler

df = pd.DataFrame({'Score': [1, 2, 3, 100]})

scaler = RobustScaler()

df['Robust_Score'] = scaler.fit_transform(df[['Score']])

print(df)

Major links

Quicklinks

Thursday, August 7, 2025

Data Transformation and Feature Scaling in Data Preprocessing

Link to Data Preprocessing Home

No comments:

Post a Comment

Featured Content

Multiple choice questions in Natural Language Processing Home

All time most popular contents

Report Abuse