Monday, August 11, 2025

Encoding Categorical Variables

Encoding categorical variables during data preprocessing in machine learning and data science.

Encoding Categorical Variables

Mandatory when dataset contains non-numeric features.

Why We Need to Encode Categorical Variables in Machine Learning?

Most machine learning algorithms can only process numerical input — they don’t understand text or categories like "Red", "Dog", or "India".

So we encode categorical variables to convert them into numeric format that models can interpret and learn from.

Key Problems with Raw Categorical Data:

Problem

Why it’s a Problem

Categorical data is not numeric

Algorithms require numeric operations

No natural order in text

"Blue" > "Red" makes no sense numerically

High cardinality

Increases model complexity if not handled properly

Risk of misleading the model

If text is converted to integers improperly

When to Encode Categorical Variables?

Scenario

Encoding Technique

Why

Model cannot handle strings

Any encoding

Models like scikit-learn, XGBoost, etc. require numeric

Ordinal categories (Low, Medium, High)

Label Encoding

There is an order that must be preserved

Nominal categories (Color, City)

One-Hot Encoding

There is no order, each value should be independent

High-cardinality features (Zip, ID, Product)

Target/Frequency Encoding

Reduces dimensionality compared to one-hot

Preparing for distance-based models (KNN)

One-Hot or Binary Encoding

Avoid false closeness from label encoding

Tree-based models (Random Forest, XGBoost)

Label or Target Encoding

Trees are not sensitive to monotonic encodings

Example: Why You Must Encode

from sklearn.linear_model import LogisticRegression

import pandas as pd

df = pd.DataFrame({

'Color': ['Red', 'Green', 'Blue'],

'Label': [1, 0, 1]

})

# This will raise an error:

model = LogisticRegression()

model.fit(df[['Color']], df['Label']) # Error: Cannot handle strings

Fix it with encoding:

df_encoded = pd.get_dummies(df, columns=['Color'])

model.fit(df_encoded, df['Label']) #Works fine

Why Encode?

When to Do It?

Models can’t handle strings

Before training any non-NLP model

Maintain ordinal/numerical meaning

When feature has ranked categories

Reduce dimensionality

For features with many unique values

Avoid misleading the model

Always, when feeding categorical data

Different Encoding Techniques

Problem	Why it’s a Problem
Categorical data is not numeric	Algorithms require numeric operations
No natural order in text	"Blue" > "Red" makes no sense numerically
High cardinality	Increases model complexity if not handled properly
Risk of misleading the model	If text is converted to integers improperly

Scenario	Encoding Technique	Why
Model cannot handle strings	Any encoding	Models like scikit-learn, XGBoost, etc. require numeric
Ordinal categories (Low, Medium, High)	Label Encoding	There is an order that must be preserved
Nominal categories (Color, City)	One-Hot Encoding	There is no order, each value should be independent
High-cardinality features (Zip, ID, Product)	Target/Frequency Encoding	Reduces dimensionality compared to one-hot
Preparing for distance-based models (KNN)	One-Hot or Binary Encoding	Avoid false closeness from label encoding
Tree-based models (Random Forest, XGBoost)	Label or Target Encoding	Trees are not sensitive to monotonic encodings

Why Encode?	When to Do It?
Models can’t handle strings	Before training any non-NLP model
Maintain ordinal/numerical meaning	When feature has ranked categories
Reduce dimensionality	For features with many unique values
Avoid misleading the model	Always, when feeding categorical data

Label Encoding (Ordinal data) - Converts each category into an integer (e.g., Low = 0, Medium = 1, High = 2).

When to use?

When categories have natural orders (ordinal).
Examples: ["Low", "Medium", "High"], ["Poor", "Average", "Good"].

Python Example:

import pandas as pd

from sklearn.preprocessing import LabelEncoder

df = pd.DataFrame({'Size': ['Small', 'Medium', 'Large', 'Medium', 'Small']})

encoder = LabelEncoder()

df['Size_LabelEncoded'] = encoder.fit_transform(df['Size'])

print(df)

One-Hot Encoding - Creates separate binary columns (0/1) for each category

When to use?

For nominal data (no inherent order).
Examples: ["Red", "Blue", "Green"], ["Dog", "Cat", "Bird"].

Python Example:

from sklearn.preprocessing import RobustScaler

df = pd.DataFrame({'Score': [1, 2, 3, 100]})

scaler = RobustScaler()

df['Robust_Score'] = scaler.fit_transform(df[['Score']])

print(df)

Target Encoding (Mean Encoding) - Replaces each category with the mean of the target variable for that category.

When to use?

For high-cardinality features (e.g., hundreds of categories like ZIP codes).
Must be used carefully to avoid data leakage.

Python Example:

df = pd.DataFrame({

'City': ['Delhi', 'Mumbai', 'Chennai', 'Delhi', 'Chennai', 'Mumbai'],

'Sales': [100, 200, 150, 130, 160, 220]

})

# Target encoding: average sales per city

mean_encoded = df.groupby('City')['Sales'].mean()

df['City_TargetEncoded'] = df['City'].map(mean_encoded)

print(df)

Frequency Encoding - Replaces each category with the frequency (count) or proportion.

When to use?

Also for high-cardinality categorical columns.
Safer than target encoding (no leakage risk).

Python Example:

df = pd.DataFrame({'Product': ['A', 'B', 'A', 'C', 'B', 'A']})

# Frequency encoding

freq = df['Product'].value_counts()

df['Product_FrequencyEncoded'] = df['Product'].map(freq)

print(df)

TOPICS (Click to Navigate)

Monday, August 11, 2025