Data transformation in data preprocessing, what is feature scaling and why it is important?
Data Transformation / Feature Scaling
Mandatory
for algorithms that are sensitive to feature scale.
Data Transformation – Data transformation refers to changing
the format, structure, or distribution of your data to make it more
suitable for analysis or modeling.
Feature Scaling – Feature scaling is a specific
kind of data transformation that focuses on rescaling numeric features
so they have similar ranges or distributions.
Following are few of the important data transformation / feature scaling methods;
- Normalization (Min-Max Scaling): Scales values between 0 and 1.
It does not handle outliers.
- When
to use?
- When features are on a different scale
- Required for algorithms like KNN, SVM, Neural
Networks
o
Formula:
Python Example:
from sklearn.preprocessing import MinMaxScaler
import pandas as pd
df =
pd.DataFrame({'Salary': [20000, 50000, 100000, 150000]})
scaler =
MinMaxScaler()
df['Normalized_Salary'] = scaler.fit_transform(df[['Salary']])
print(df)
- Standardization (Z-score Scaling): Scales values to have mean 0
and standard deviation 1. It detects outliers.
- When
to use?
- When data should have mean = 0 and standard
deviation = 1
- Needed for linear regression, logistic
regression, PCA
- Formula:
Python Example:
from sklearn.preprocessing import StandardScaler
df = pd.DataFrame({'Salary': [20000, 50000, 100000,
150000]})
scaler = StandardScaler()
df['Standardized_Salary'] =
scaler.fit_transform(df[['Salary']])
print(df)
- Log Transformation: For skewed distributions
- When to use?
- To
reduce right skewness (Right skewness (also called positive
skew) refers to a distribution of data where most values are
concentrated on the left, but the tail extends to the right.) in the
data.
- Help
stabilize variance.
Python Example:
import numpy
as np
df = pd.DataFrame({'Income': [1000, 10000, 100000, 1000000]})
df['Log_Income']
= np.log(df['Income'])
print(df)
- Box-Cox Transformation: Advanced transformation for
skewed distributions. Works only with positive values.
Python Example:
from
scipy.stats import boxcox
df = pd.DataFrame({'Sales': [1, 5, 10, 50, 100]})
# Apply
Box-Cox transformation
df['BoxCox_Sales'],
_ = boxcox(df['Sales'])
print(df)
- Robust Scaling:
- When to use?
- When
data contains outliers.
- Uses
median and IQR instead of mean and standard deviation.
Python Example:
from
sklearn.preprocessing import RobustScaler
df = pd.DataFrame({'Score': [1, 2, 3, 100]})
scaler =
RobustScaler()
df['Robust_Score']
= scaler.fit_transform(df[['Score']])
print(df)
No comments:
Post a Comment