TOPICS (Click to Navigate)

Pages

Monday, August 11, 2025

Dimensionality Reduction

Dimensionality reduction techniques in machine learning, use of dimensionality reduction techniques in machine learning to simplify the models 


Dimensionality Reduction

Creates new features by transforming the original ones into a lower-dimensional space.

Dimensionality reduction is the process of reducing the number of input variables (features) in a dataset while preserving as much important information as possible.

In data analytics and machine learning, datasets can have dozens, hundreds, or even thousands of features — but not all of them are equally important. Too many features can lead to:

  • High computational cost (slower training, more memory usage).
  • Overfitting (model learns noise instead of patterns).
  • Difficulty in visualization and interpretation (especially beyond 3D).

 

“Dimensionality reduction simplifies models, removes redundancy, reduces noise, and helps visualization”.

PCA / t-SNE / UMAP (especially for high-dimensional data)

·  Principal Component Analysis (PCA) - Reduce dimensions by creating new uncorrelated components that explain variance. Transform features into new uncorrelated components that retain maximum variance. PCA does not use the target variable (i.e., y or labels) when reducing dimensionality. It only considers the features (X). In other words, PCA looks for directions (principal components) in the feature space that capture the most variance.

o    When to use?

§  High-dimensional data.

§  When you want to reduce dimensionality without losing much information.

Python Example:

from sklearn.decomposition import PCA

from sklearn.preprocessing import StandardScaler

import matplotlib.pyplot as plt

X_scaled = StandardScaler().fit_transform(X)  # Important step

pca = PCA(n_components=2)

X_pca = pca.fit_transform(X_scaled)

print("Explained Variance Ratio:", pca.explained_variance_ratio_)


·    t-SNE (t-Distributed Stochastic Neighbor Embedding) - Visualize complex high-dimensional data in 2D or 3D by preserving local structure.

o    When to use?

§  High-dimensional data.

§  When you want to reduce dimensionality without losing much information.

Python Example:

from sklearn.decomposition import PCA

from sklearn.preprocessing import StandardScaler

import matplotlib.pyplot as plt

X_scaled = StandardScaler().fit_transform(X)  # Important step

pca = PCA(n_components=2)

X_pca = pca.fit_transform(X_scaled)

print("Explained Variance Ratio:", pca.explained_variance_ratio_)


·    UMAP (Uniform Manifold Approximation and Projection) - Dimensionality reduction like t-SNE, but faster and preserves more global structure. Great for clustering or visualization.

o    When to use?

§  Visualization or clustering of high-dimensional data.

§  Works better than t-SNE for larger datasets.

Python Example:

import umap.umap_ as umap

reducer = umap.UMAP(n_components=2, random_state=42)

X_umap = reducer.fit_transform(X)

plt.scatter(X_umap[:, 0], X_umap[:, 1], c=y)

plt.title("UMAP visualization")

plt.show()

 


Link to Data Preprocessing Home


No comments:

Post a Comment