Showing posts with label data preprocessing. Show all posts

Monday, August 11, 2025

Dimensionality Reduction

Dimensionality reduction techniques in machine learning, use of dimensionality reduction techniques in machine learning to simplify the models

Dimensionality Reduction

Creates new features by transforming the original ones into a lower-dimensional space.

Dimensionality reduction is the process of reducing the number of input variables (features) in a dataset while preserving as much important information as possible.

In data analytics and machine learning, datasets can have dozens, hundreds, or even thousands of features — but not all of them are equally important. Too many features can lead to:

High computational cost (slower training, more memory usage).
Overfitting (model learns noise instead of patterns).
Difficulty in visualization and interpretation (especially beyond 3D).

“Dimensionality reduction simplifies models, removes redundancy, reduces noise, and helps visualization”.

PCA / t-SNE / UMAP (especially for high-dimensional data)

· Principal Component Analysis (PCA) - Reduce dimensions by creating new uncorrelated components that explain variance. Transform features into new uncorrelated components that retain maximum variance. PCA does not use the target variable (i.e., y or labels) when reducing dimensionality. It only considers the features (X). In other words, PCA looks for directions (principal components) in the feature space that capture the most variance.

o When to use?

§ High-dimensional data.

§ When you want to reduce dimensionality without losing much information.

Python Example:

from sklearn.decomposition import PCA

from sklearn.preprocessing import StandardScaler

import matplotlib.pyplot as plt

X_scaled = StandardScaler().fit_transform(X) # Important step

pca = PCA(n_components=2)

X_pca = pca.fit_transform(X_scaled)

print("Explained Variance Ratio:", pca.explained_variance_ratio_)

· t-SNE (t-Distributed Stochastic Neighbor Embedding) - Visualize complex high-dimensional data in 2D or 3D by preserving local structure.

o When to use?

§ High-dimensional data.

§ When you want to reduce dimensionality without losing much information.

Python Example:

from sklearn.decomposition import PCA

from sklearn.preprocessing import StandardScaler

import matplotlib.pyplot as plt

X_scaled = StandardScaler().fit_transform(X) # Important step

pca = PCA(n_components=2)

X_pca = pca.fit_transform(X_scaled)

print("Explained Variance Ratio:", pca.explained_variance_ratio_)

· UMAP (Uniform Manifold Approximation and Projection) - Dimensionality reduction like t-SNE, but faster and preserves more global structure. Great for clustering or visualization.

o When to use?

§ Visualization or clustering of high-dimensional data.

§ Works better than t-SNE for larger datasets.

Python Example:

import umap.umap_ as umap

reducer = umap.UMAP(n_components=2, random_state=42)

X_umap = reducer.fit_transform(X)

plt.scatter(X_umap[:, 0], X_umap[:, 1], c=y)

plt.title("UMAP visualization")

plt.show()

Link to Data Preprocessing Home

Encoding Categorical Variables

Encoding categorical variables during data preprocessing in machine learning and data science.

Encoding Categorical Variables

Mandatory when dataset contains non-numeric features.

Why We Need to Encode Categorical Variables in Machine Learning?

Most machine learning algorithms can only process numerical input — they don’t understand text or categories like "Red", "Dog", or "India".

So we encode categorical variables to convert them into numeric format that models can interpret and learn from.

Key Problems with Raw Categorical Data:

Problem

Why it’s a Problem

Categorical data is not numeric

Algorithms require numeric operations

No natural order in text

"Blue" > "Red" makes no sense numerically

High cardinality

Increases model complexity if not handled properly

Risk of misleading the model

If text is converted to integers improperly

When to Encode Categorical Variables?

Scenario

Encoding Technique

Why

Model cannot handle strings

Any encoding

Models like scikit-learn, XGBoost, etc. require numeric

Ordinal categories (Low, Medium, High)

Label Encoding

There is an order that must be preserved

Nominal categories (Color, City)

One-Hot Encoding

There is no order, each value should be independent

High-cardinality features (Zip, ID, Product)

Target/Frequency Encoding

Reduces dimensionality compared to one-hot

Preparing for distance-based models (KNN)

One-Hot or Binary Encoding

Avoid false closeness from label encoding

Tree-based models (Random Forest, XGBoost)

Label or Target Encoding

Trees are not sensitive to monotonic encodings

Example: Why You Must Encode

from sklearn.linear_model import LogisticRegression

import pandas as pd

df = pd.DataFrame({

'Color': ['Red', 'Green', 'Blue'],

'Label': [1, 0, 1]

})

# This will raise an error:

model = LogisticRegression()

model.fit(df[['Color']], df['Label']) # Error: Cannot handle strings

Fix it with encoding:

df_encoded = pd.get_dummies(df, columns=['Color'])

model.fit(df_encoded, df['Label']) #Works fine

Why Encode?

When to Do It?

Models can’t handle strings

Before training any non-NLP model

Maintain ordinal/numerical meaning

When feature has ranked categories

Reduce dimensionality

For features with many unique values

Avoid misleading the model

Always, when feeding categorical data

Different Encoding Techniques

Problem	Why it’s a Problem
Categorical data is not numeric	Algorithms require numeric operations
No natural order in text	"Blue" > "Red" makes no sense numerically
High cardinality	Increases model complexity if not handled properly
Risk of misleading the model	If text is converted to integers improperly

Scenario	Encoding Technique	Why
Model cannot handle strings	Any encoding	Models like scikit-learn, XGBoost, etc. require numeric
Ordinal categories (Low, Medium, High)	Label Encoding	There is an order that must be preserved
Nominal categories (Color, City)	One-Hot Encoding	There is no order, each value should be independent
High-cardinality features (Zip, ID, Product)	Target/Frequency Encoding	Reduces dimensionality compared to one-hot
Preparing for distance-based models (KNN)	One-Hot or Binary Encoding	Avoid false closeness from label encoding
Tree-based models (Random Forest, XGBoost)	Label or Target Encoding	Trees are not sensitive to monotonic encodings

Why Encode?	When to Do It?
Models can’t handle strings	Before training any non-NLP model
Maintain ordinal/numerical meaning	When feature has ranked categories
Reduce dimensionality	For features with many unique values
Avoid misleading the model	Always, when feeding categorical data

Label Encoding (Ordinal data) - Converts each category into an integer (e.g., Low = 0, Medium = 1, High = 2).

When to use?

When categories have natural orders (ordinal).
Examples: ["Low", "Medium", "High"], ["Poor", "Average", "Good"].

Python Example:

import pandas as pd

from sklearn.preprocessing import LabelEncoder

df = pd.DataFrame({'Size': ['Small', 'Medium', 'Large', 'Medium', 'Small']})

encoder = LabelEncoder()

df['Size_LabelEncoded'] = encoder.fit_transform(df['Size'])

print(df)

One-Hot Encoding - Creates separate binary columns (0/1) for each category

When to use?

For nominal data (no inherent order).
Examples: ["Red", "Blue", "Green"], ["Dog", "Cat", "Bird"].

Python Example:

from sklearn.preprocessing import RobustScaler

df = pd.DataFrame({'Score': [1, 2, 3, 100]})

scaler = RobustScaler()

df['Robust_Score'] = scaler.fit_transform(df[['Score']])

print(df)

Target Encoding (Mean Encoding) - Replaces each category with the mean of the target variable for that category.

When to use?

For high-cardinality features (e.g., hundreds of categories like ZIP codes).
Must be used carefully to avoid data leakage.

Python Example:

df = pd.DataFrame({

'City': ['Delhi', 'Mumbai', 'Chennai', 'Delhi', 'Chennai', 'Mumbai'],

'Sales': [100, 200, 150, 130, 160, 220]

})

# Target encoding: average sales per city

mean_encoded = df.groupby('City')['Sales'].mean()

df['City_TargetEncoded'] = df['City'].map(mean_encoded)

print(df)

Frequency Encoding - Replaces each category with the frequency (count) or proportion.

When to use?

Also for high-cardinality categorical columns.
Safer than target encoding (no leakage risk).

Python Example:

df = pd.DataFrame({'Product': ['A', 'B', 'A', 'C', 'B', 'A']})

# Frequency encoding

freq = df['Product'].value_counts()

df['Product_FrequencyEncoded'] = df['Product'].map(freq)

print(df)

Link to Data Preprocessing Home

Thursday, August 7, 2025

Data Transformation and Feature Scaling in Data Preprocessing

Data transformation in data preprocessing, what is feature scaling and why it is important?

Data Transformation / Feature Scaling

Mandatory for algorithms that are sensitive to feature scale.

Data Transformation – Data transformation refers to changing the format, structure, or distribution of your data to make it more suitable for analysis or modeling.

Feature Scaling – Feature scaling is a specific kind of data transformation that focuses on rescaling numeric features so they have similar ranges or distributions.

Following are few of the important data transformation / feature scaling methods;

Normalization (Min-Max Scaling): Scales values between 0 and 1. It does not handle outliers.

When to use?

When features are on a different scale
Required for algorithms like KNN, SVM, Neural Networks

o Formula:

$X' = \frac{X - X_{\text{min}}}{X_{\text{max}} - X_{\text{min}}}$

Python Example:

from sklearn.preprocessing import MinMaxScaler

import pandas as pd

df = pd.DataFrame({'Salary': [20000, 50000, 100000, 150000]})

scaler = MinMaxScaler()

df['Normalized_Salary'] = scaler.fit_transform(df[['Salary']])

print(df)

Standardization (Z-score Scaling): Scales values to have mean 0 and standard deviation 1. It detects outliers.

When to use?

When data should have mean = 0 and standard deviation = 1
Needed for linear regression, logistic regression, PCA

Formula:

X' = \frac{X - \mu}{\sigma}

Python Example:

from sklearn.preprocessing import StandardScaler

df = pd.DataFrame({'Salary': [20000, 50000, 100000, 150000]})

scaler = StandardScaler()

df['Standardized_Salary'] = scaler.fit_transform(df[['Salary']])

print(df)

Log Transformation: For skewed distributions

When to use?

To reduce right skewness (Right skewness (also called positive skew) refers to a distribution of data where most values are concentrated on the left, but the tail extends to the right.) in the data.
Help stabilize variance.

Python Example:

import numpy as np

df = pd.DataFrame({'Income': [1000, 10000, 100000, 1000000]})

df['Log_Income'] = np.log(df['Income'])

print(df)

Box-Cox Transformation: Advanced transformation for skewed distributions. Works only with positive values.

Python Example:

from scipy.stats import boxcox

df = pd.DataFrame({'Sales': [1, 5, 10, 50, 100]})

# Apply Box-Cox transformation

df['BoxCox_Sales'], _ = boxcox(df['Sales'])

print(df)

Robust Scaling:

When to use?

When data contains outliers.
Uses median and IQR instead of mean and standard deviation.

Python Example:

from sklearn.preprocessing import RobustScaler

df = pd.DataFrame({'Score': [1, 2, 3, 100]})

scaler = RobustScaler()

df['Robust_Score'] = scaler.fit_transform(df[['Score']])

print(df)

TOPICS (Click to Navigate)

Monday, August 11, 2025

Dimensionality Reduction

Dimensionality Reduction

Dimensionality reduction is the process of reducing the number of input variables (features) in a dataset while preserving as much important information as possible.

“Dimensionality reduction simplifies models, removes redundancy, reduces noise, and helps visualization”.

· t-SNE (t-Distributed Stochastic Neighbor Embedding) - Visualize complex high-dimensional data in 2D or 3D by preserving local structure.

· UMAP (Uniform Manifold Approximation and Projection) - Dimensionality reduction like t-SNE, but faster and preserves more global structure. Great for clustering or visualization.

Link to Data Preprocessing Home

Encoding Categorical Variables

Encoding categorical variables during data preprocessing in machine learning and data science.

Encoding Categorical Variables

Why We Need to Encode Categorical Variables in Machine Learning?

Label Encoding (Ordinal data) - Converts each category into an integer (e.g., Low = 0, Medium = 1, High = 2).

One-Hot Encoding - Creates separate binary columns (0/1) for each category

Target Encoding (Mean Encoding) - Replaces each category with the mean of the target variable for that category.

Frequency Encoding - Replaces each category with the frequency (count) or proportion.

Link to Data Preprocessing Home

Thursday, August 7, 2025

Data Transformation and Feature Scaling in Data Preprocessing

Link to Data Preprocessing Home

Featured Content

Multiple choice questions in Natural Language Processing Home

All time most popular contents

Report Abuse