Major links



Quicklinks


📌 Quick Links
[ DBMS ] [ DDB ] [ ML ] [ DL ] [ NLP ] [ DSA ] [ PDB ] [ DWDM ] [ Quizzes ]


Thursday, August 7, 2025

Data Transformation and Feature Scaling in Data Preprocessing

Data transformation in data preprocessing, what is feature scaling and why it is important?


Data Transformation / Feature Scaling

Mandatory for algorithms that are sensitive to feature scale.

Data Transformation – Data transformation refers to changing the format, structure, or distribution of your data to make it more suitable for analysis or modeling.

Feature Scaling – Feature scaling is a specific kind of data transformation that focuses on rescaling numeric features so they have similar ranges or distributions.

Following are few of the important data transformation / feature scaling methods; 

  • Normalization (Min-Max Scaling): Scales values between 0 and 1. It does not handle outliers.

    • When to use?
      • When features are on a different scale
      • Required for algorithms like KNN, SVM, Neural Networks

o    Formula:

x=xxminxmaxxminX' = \frac{X - X_{\text{min}}}{X_{\text{max}} - X_{\text{min}}}

Python Example:

 

from sklearn.preprocessing import MinMaxScaler

import pandas as pd

 

df = pd.DataFrame({'Salary': [20000, 50000, 100000, 150000]})

 

scaler = MinMaxScaler()

df['Normalized_Salary'] = scaler.fit_transform(df[['Salary']])

print(df)


  • Standardization (Z-score Scaling): Scales values to have mean 0 and standard deviation 1. It detects outliers.

    • When to use?
      • When data should have mean = 0 and standard deviation = 1
      • Needed for linear regression, logistic regression, PCA
    • Formula:


x = xμσX' = \frac{X - \mu}{\sigma}

Python Example:

from sklearn.preprocessing import StandardScaler

df = pd.DataFrame({'Salary': [20000, 50000, 100000, 150000]})

scaler = StandardScaler()

df['Standardized_Salary'] = scaler.fit_transform(df[['Salary']])

print(df)

 

  • Log  Transformation: For skewed distributions

    • When to use?
      • To reduce right skewness (Right skewness (also called positive skew) refers to a distribution of data where most values are concentrated on the left, but the tail extends to the right.) in the data.
      • Help stabilize variance.

Python Example:

import numpy as np

      df = pd.DataFrame({'Income': [1000, 10000, 100000, 1000000]})

df['Log_Income'] = np.log(df['Income'])

 

print(df)

 

  • Box-Cox  Transformation: Advanced transformation for skewed distributions. Works only with positive values.

Python Example:

from scipy.stats import boxcox

df = pd.DataFrame({'Sales': [1, 5, 10, 50, 100]})

 

# Apply Box-Cox transformation

df['BoxCox_Sales'], _ = boxcox(df['Sales'])

 

print(df)

 

  • Robust Scaling:

    • When to use?
      • When data contains outliers.
      • Uses median and IQR instead of mean and standard deviation.

Python Example:

from sklearn.preprocessing import RobustScaler

df = pd.DataFrame({'Score': [1, 2, 3, 100]})

scaler = RobustScaler()

df['Robust_Score'] = scaler.fit_transform(df[['Score']])

 

print(df)

 

 

Link to Data Preprocessing Home


No comments:

Post a Comment

Please visit, subscribe and share 10 Minutes Lectures in Computer Science