Showing posts with label data preprocessing. Show all posts
Showing posts with label data preprocessing. Show all posts

Monday, August 11, 2025

Dimensionality Reduction

Dimensionality reduction techniques in machine learning, use of dimensionality reduction techniques in machine learning to simplify the models 


Dimensionality Reduction

Creates new features by transforming the original ones into a lower-dimensional space.

Dimensionality reduction is the process of reducing the number of input variables (features) in a dataset while preserving as much important information as possible.

In data analytics and machine learning, datasets can have dozens, hundreds, or even thousands of features — but not all of them are equally important. Too many features can lead to:

  • High computational cost (slower training, more memory usage).
  • Overfitting (model learns noise instead of patterns).
  • Difficulty in visualization and interpretation (especially beyond 3D).

 

“Dimensionality reduction simplifies models, removes redundancy, reduces noise, and helps visualization”.

PCA / t-SNE / UMAP (especially for high-dimensional data)

·  Principal Component Analysis (PCA) - Reduce dimensions by creating new uncorrelated components that explain variance. Transform features into new uncorrelated components that retain maximum variance. PCA does not use the target variable (i.e., y or labels) when reducing dimensionality. It only considers the features (X). In other words, PCA looks for directions (principal components) in the feature space that capture the most variance.

o    When to use?

§  High-dimensional data.

§  When you want to reduce dimensionality without losing much information.

Python Example:

from sklearn.decomposition import PCA

from sklearn.preprocessing import StandardScaler

import matplotlib.pyplot as plt

X_scaled = StandardScaler().fit_transform(X)  # Important step

pca = PCA(n_components=2)

X_pca = pca.fit_transform(X_scaled)

print("Explained Variance Ratio:", pca.explained_variance_ratio_)


·    t-SNE (t-Distributed Stochastic Neighbor Embedding) - Visualize complex high-dimensional data in 2D or 3D by preserving local structure.

o    When to use?

§  High-dimensional data.

§  When you want to reduce dimensionality without losing much information.

Python Example:

from sklearn.decomposition import PCA

from sklearn.preprocessing import StandardScaler

import matplotlib.pyplot as plt

X_scaled = StandardScaler().fit_transform(X)  # Important step

pca = PCA(n_components=2)

X_pca = pca.fit_transform(X_scaled)

print("Explained Variance Ratio:", pca.explained_variance_ratio_)


·    UMAP (Uniform Manifold Approximation and Projection) - Dimensionality reduction like t-SNE, but faster and preserves more global structure. Great for clustering or visualization.

o    When to use?

§  Visualization or clustering of high-dimensional data.

§  Works better than t-SNE for larger datasets.

Python Example:

import umap.umap_ as umap

reducer = umap.UMAP(n_components=2, random_state=42)

X_umap = reducer.fit_transform(X)

plt.scatter(X_umap[:, 0], X_umap[:, 1], c=y)

plt.title("UMAP visualization")

plt.show()

 


Link to Data Preprocessing Home


Encoding Categorical Variables

Encoding categorical variables during data preprocessing in machine learning and data science.


Encoding Categorical Variables

Mandatory when dataset contains non-numeric features.

Why We Need to Encode Categorical Variables in Machine Learning?

Most machine learning algorithms can only process numerical input — they don’t understand text or categories like "Red", "Dog", or "India".

So we encode categorical variables to convert them into numeric format that models can interpret and learn from.

 

Key Problems with Raw Categorical Data:

Problem

Why it’s a Problem

Categorical data is not numeric

Algorithms require numeric operations

No natural order in text

"Blue" > "Red" makes no sense numerically

High cardinality

Increases model complexity if not handled properly

Risk of misleading the model

If text is converted to integers improperly

 

When to Encode Categorical Variables?

Scenario

Encoding Technique

Why

Model cannot handle strings

Any encoding

Models like scikit-learn, XGBoost, etc. require numeric

Ordinal categories (Low, Medium, High)

Label Encoding

There is an order that must be preserved

Nominal categories (Color, City)

One-Hot Encoding

There is no order, each value should be independent

High-cardinality features (Zip, ID, Product)

Target/Frequency Encoding

Reduces dimensionality compared to one-hot

Preparing for distance-based models (KNN)

One-Hot or Binary Encoding

Avoid false closeness from label encoding

Tree-based models (Random Forest, XGBoost)

Label or Target Encoding

Trees are not sensitive to monotonic encodings

 

Example: Why You Must Encode

from sklearn.linear_model import LogisticRegression

import pandas as pd

df = pd.DataFrame({

    'Color': ['Red', 'Green', 'Blue'],

    'Label': [1, 0, 1]

})

 

# This will raise an error:

model = LogisticRegression()

model.fit(df[['Color']], df['Label'])  # Error: Cannot handle strings


Fix it with encoding:

df_encoded = pd.get_dummies(df, columns=['Color'])

model.fit(df_encoded, df['Label'])  #Works fine

 

Why Encode?

When to Do It?

Models can’t handle strings

Before training any non-NLP model

Maintain ordinal/numerical meaning

When feature has ranked categories

Reduce dimensionality

For features with many unique values

Avoid misleading the model

Always, when feeding categorical data

 

Different Encoding Techniques

  • Label Encoding (Ordinal data) - Converts each category into an integer (e.g., Low = 0, Medium = 1, High = 2).

    • When to use?
      • When categories have natural orders (ordinal).
      • Examples: ["Low", "Medium", "High"], ["Poor", "Average", "Good"].

Python Example:

import pandas as pd

from sklearn.preprocessing import LabelEncoder

df = pd.DataFrame({'Size': ['Small', 'Medium', 'Large', 'Medium', 'Small']})

encoder = LabelEncoder()

df['Size_LabelEncoded'] = encoder.fit_transform(df['Size'])

print(df)

 

  • One-Hot Encoding - Creates separate binary columns (0/1) for each category

    • When to use?
      • For nominal data (no inherent order).
      • Examples: ["Red", "Blue", "Green"], ["Dog", "Cat", "Bird"].

Python Example:

from sklearn.preprocessing import RobustScaler

df = pd.DataFrame({'Score': [1, 2, 3, 100]})

scaler = RobustScaler()

df['Robust_Score'] = scaler.fit_transform(df[['Score']])

print(df)

 

  • Target Encoding (Mean Encoding) - Replaces each category with the mean of the target variable for that category.

    • When to use?
      • For high-cardinality features (e.g., hundreds of categories like ZIP codes).
      • Must be used carefully to avoid data leakage.

Python Example:

df = pd.DataFrame({

    'City': ['Delhi', 'Mumbai', 'Chennai', 'Delhi', 'Chennai', 'Mumbai'],

    'Sales': [100, 200, 150, 130, 160, 220]

})

# Target encoding: average sales per city

mean_encoded = df.groupby('City')['Sales'].mean()

df['City_TargetEncoded'] = df['City'].map(mean_encoded)

print(df)


  • Frequency Encoding - Replaces each category with the frequency (count) or proportion.

    • When to use?
      • Also for high-cardinality categorical columns.
      • Safer than target encoding (no leakage risk).

Python Example:

df = pd.DataFrame({'Product': ['A', 'B', 'A', 'C', 'B', 'A']})

# Frequency encoding

freq = df['Product'].value_counts()

df['Product_FrequencyEncoded'] = df['Product'].map(freq)

print(df)

 

Link to Data Preprocessing Home


 

Thursday, August 7, 2025

Data Transformation and Feature Scaling in Data Preprocessing

Data transformation in data preprocessing, what is feature scaling and why it is important?


Data Transformation / Feature Scaling

Mandatory for algorithms that are sensitive to feature scale.

Data Transformation – Data transformation refers to changing the format, structure, or distribution of your data to make it more suitable for analysis or modeling.

Feature Scaling – Feature scaling is a specific kind of data transformation that focuses on rescaling numeric features so they have similar ranges or distributions.

Following are few of the important data transformation / feature scaling methods; 

  • Normalization (Min-Max Scaling): Scales values between 0 and 1. It does not handle outliers.

    • When to use?
      • When features are on a different scale
      • Required for algorithms like KNN, SVM, Neural Networks

o    Formula:

x=xxminxmaxxminX' = \frac{X - X_{\text{min}}}{X_{\text{max}} - X_{\text{min}}}

Python Example:

 

from sklearn.preprocessing import MinMaxScaler

import pandas as pd

 

df = pd.DataFrame({'Salary': [20000, 50000, 100000, 150000]})

 

scaler = MinMaxScaler()

df['Normalized_Salary'] = scaler.fit_transform(df[['Salary']])

print(df)


  • Standardization (Z-score Scaling): Scales values to have mean 0 and standard deviation 1. It detects outliers.

    • When to use?
      • When data should have mean = 0 and standard deviation = 1
      • Needed for linear regression, logistic regression, PCA
    • Formula:


x = xμσX' = \frac{X - \mu}{\sigma}

Python Example:

from sklearn.preprocessing import StandardScaler

df = pd.DataFrame({'Salary': [20000, 50000, 100000, 150000]})

scaler = StandardScaler()

df['Standardized_Salary'] = scaler.fit_transform(df[['Salary']])

print(df)

 

  • Log  Transformation: For skewed distributions

    • When to use?
      • To reduce right skewness (Right skewness (also called positive skew) refers to a distribution of data where most values are concentrated on the left, but the tail extends to the right.) in the data.
      • Help stabilize variance.

Python Example:

import numpy as np

      df = pd.DataFrame({'Income': [1000, 10000, 100000, 1000000]})

df['Log_Income'] = np.log(df['Income'])

 

print(df)

 

  • Box-Cox  Transformation: Advanced transformation for skewed distributions. Works only with positive values.

Python Example:

from scipy.stats import boxcox

df = pd.DataFrame({'Sales': [1, 5, 10, 50, 100]})

 

# Apply Box-Cox transformation

df['BoxCox_Sales'], _ = boxcox(df['Sales'])

 

print(df)

 

  • Robust Scaling:

    • When to use?
      • When data contains outliers.
      • Uses median and IQR instead of mean and standard deviation.

Python Example:

from sklearn.preprocessing import RobustScaler

df = pd.DataFrame({'Score': [1, 2, 3, 100]})

scaler = RobustScaler()

df['Robust_Score'] = scaler.fit_transform(df[['Score']])

 

print(df)

 

 

Link to Data Preprocessing Home


Featured Content

Multiple choice questions in Natural Language Processing Home

MCQ in Natural Language Processing, Quiz questions with answers in NLP, Top interview questions in NLP with answers Multiple Choice Que...

All time most popular contents