Monday, August 11, 2025

Encoding Categorical Variables

Encoding categorical variables during data preprocessing in machine learning and data science.


Encoding Categorical Variables

Mandatory when dataset contains non-numeric features.

Why We Need to Encode Categorical Variables in Machine Learning?

Most machine learning algorithms can only process numerical input — they don’t understand text or categories like "Red", "Dog", or "India".

So we encode categorical variables to convert them into numeric format that models can interpret and learn from.

 

Key Problems with Raw Categorical Data:

Problem

Why it’s a Problem

Categorical data is not numeric

Algorithms require numeric operations

No natural order in text

"Blue" > "Red" makes no sense numerically

High cardinality

Increases model complexity if not handled properly

Risk of misleading the model

If text is converted to integers improperly

 

When to Encode Categorical Variables?

Scenario

Encoding Technique

Why

Model cannot handle strings

Any encoding

Models like scikit-learn, XGBoost, etc. require numeric

Ordinal categories (Low, Medium, High)

Label Encoding

There is an order that must be preserved

Nominal categories (Color, City)

One-Hot Encoding

There is no order, each value should be independent

High-cardinality features (Zip, ID, Product)

Target/Frequency Encoding

Reduces dimensionality compared to one-hot

Preparing for distance-based models (KNN)

One-Hot or Binary Encoding

Avoid false closeness from label encoding

Tree-based models (Random Forest, XGBoost)

Label or Target Encoding

Trees are not sensitive to monotonic encodings

 

Example: Why You Must Encode

from sklearn.linear_model import LogisticRegression

import pandas as pd

df = pd.DataFrame({

    'Color': ['Red', 'Green', 'Blue'],

    'Label': [1, 0, 1]

})

 

# This will raise an error:

model = LogisticRegression()

model.fit(df[['Color']], df['Label'])  # Error: Cannot handle strings


Fix it with encoding:

df_encoded = pd.get_dummies(df, columns=['Color'])

model.fit(df_encoded, df['Label'])  #Works fine

 

Why Encode?

When to Do It?

Models can’t handle strings

Before training any non-NLP model

Maintain ordinal/numerical meaning

When feature has ranked categories

Reduce dimensionality

For features with many unique values

Avoid misleading the model

Always, when feeding categorical data

 

Different Encoding Techniques

  • Label Encoding (Ordinal data) - Converts each category into an integer (e.g., Low = 0, Medium = 1, High = 2).

    • When to use?
      • When categories have natural orders (ordinal).
      • Examples: ["Low", "Medium", "High"], ["Poor", "Average", "Good"].

Python Example:

import pandas as pd

from sklearn.preprocessing import LabelEncoder

df = pd.DataFrame({'Size': ['Small', 'Medium', 'Large', 'Medium', 'Small']})

encoder = LabelEncoder()

df['Size_LabelEncoded'] = encoder.fit_transform(df['Size'])

print(df)

 

  • One-Hot Encoding - Creates separate binary columns (0/1) for each category

    • When to use?
      • For nominal data (no inherent order).
      • Examples: ["Red", "Blue", "Green"], ["Dog", "Cat", "Bird"].

Python Example:

from sklearn.preprocessing import RobustScaler

df = pd.DataFrame({'Score': [1, 2, 3, 100]})

scaler = RobustScaler()

df['Robust_Score'] = scaler.fit_transform(df[['Score']])

print(df)

 

  • Target Encoding (Mean Encoding) - Replaces each category with the mean of the target variable for that category.

    • When to use?
      • For high-cardinality features (e.g., hundreds of categories like ZIP codes).
      • Must be used carefully to avoid data leakage.

Python Example:

df = pd.DataFrame({

    'City': ['Delhi', 'Mumbai', 'Chennai', 'Delhi', 'Chennai', 'Mumbai'],

    'Sales': [100, 200, 150, 130, 160, 220]

})

# Target encoding: average sales per city

mean_encoded = df.groupby('City')['Sales'].mean()

df['City_TargetEncoded'] = df['City'].map(mean_encoded)

print(df)


  • Frequency Encoding - Replaces each category with the frequency (count) or proportion.

    • When to use?
      • Also for high-cardinality categorical columns.
      • Safer than target encoding (no leakage risk).

Python Example:

df = pd.DataFrame({'Product': ['A', 'B', 'A', 'C', 'B', 'A']})

# Frequency encoding

freq = df['Product'].value_counts()

df['Product_FrequencyEncoded'] = df['Product'].map(freq)

print(df)

 

Link to Data Preprocessing Home


 

No comments:

Post a Comment

Featured Content

Multiple choice questions in Natural Language Processing Home

MCQ in Natural Language Processing, Quiz questions with answers in NLP, Top interview questions in NLP with answers Multiple Choice Que...

All time most popular contents