TOPICS (Click to Navigate)

Pages

Thursday, August 7, 2025

Data Transformation and Feature Scaling in Data Preprocessing

Data transformation in data preprocessing, what is feature scaling and why it is important?


Data Transformation / Feature Scaling

Mandatory for algorithms that are sensitive to feature scale.

Data Transformation – Data transformation refers to changing the format, structure, or distribution of your data to make it more suitable for analysis or modeling.

Feature Scaling – Feature scaling is a specific kind of data transformation that focuses on rescaling numeric features so they have similar ranges or distributions.

Following are few of the important data transformation / feature scaling methods; 

  • Normalization (Min-Max Scaling): Scales values between 0 and 1. It does not handle outliers.

    • When to use?
      • When features are on a different scale
      • Required for algorithms like KNN, SVM, Neural Networks

o    Formula:

x=xxminxmaxxminX' = \frac{X - X_{\text{min}}}{X_{\text{max}} - X_{\text{min}}}

Python Example:

 

from sklearn.preprocessing import MinMaxScaler

import pandas as pd

 

df = pd.DataFrame({'Salary': [20000, 50000, 100000, 150000]})

 

scaler = MinMaxScaler()

df['Normalized_Salary'] = scaler.fit_transform(df[['Salary']])

print(df)


  • Standardization (Z-score Scaling): Scales values to have mean 0 and standard deviation 1. It detects outliers.

    • When to use?
      • When data should have mean = 0 and standard deviation = 1
      • Needed for linear regression, logistic regression, PCA
    • Formula:


x = xμσX' = \frac{X - \mu}{\sigma}

Python Example:

from sklearn.preprocessing import StandardScaler

df = pd.DataFrame({'Salary': [20000, 50000, 100000, 150000]})

scaler = StandardScaler()

df['Standardized_Salary'] = scaler.fit_transform(df[['Salary']])

print(df)

 

  • Log  Transformation: For skewed distributions

    • When to use?
      • To reduce right skewness (Right skewness (also called positive skew) refers to a distribution of data where most values are concentrated on the left, but the tail extends to the right.) in the data.
      • Help stabilize variance.

Python Example:

import numpy as np

      df = pd.DataFrame({'Income': [1000, 10000, 100000, 1000000]})

df['Log_Income'] = np.log(df['Income'])

 

print(df)

 

  • Box-Cox  Transformation: Advanced transformation for skewed distributions. Works only with positive values.

Python Example:

from scipy.stats import boxcox

df = pd.DataFrame({'Sales': [1, 5, 10, 50, 100]})

 

# Apply Box-Cox transformation

df['BoxCox_Sales'], _ = boxcox(df['Sales'])

 

print(df)

 

  • Robust Scaling:

    • When to use?
      • When data contains outliers.
      • Uses median and IQR instead of mean and standard deviation.

Python Example:

from sklearn.preprocessing import RobustScaler

df = pd.DataFrame({'Score': [1, 2, 3, 100]})

scaler = RobustScaler()

df['Robust_Score'] = scaler.fit_transform(df[['Score']])

 

print(df)

 

 

Link to Data Preprocessing Home


No comments:

Post a Comment