Encoding categorical variables during data preprocessing in machine learning and data science.
Encoding Categorical Variables
Mandatory
when dataset contains non-numeric features.
Why We Need to Encode Categorical Variables in Machine Learning?
Most machine learning algorithms can only process
numerical input — they don’t understand text or categories like "Red", "Dog", or "India".
So we encode categorical variables to convert
them into numeric format that models can interpret and learn from.
Key Problems with Raw Categorical Data:
Problem |
Why it’s a Problem |
Categorical
data is not numeric |
Algorithms
require numeric operations |
No natural
order in text |
"Blue"
> "Red" makes no sense numerically |
High
cardinality |
Increases
model complexity if not handled properly |
Risk of
misleading the model |
If text is
converted to integers improperly |
When to Encode Categorical Variables?
Scenario |
Encoding Technique |
Why |
Model
cannot handle strings |
Any
encoding |
Models
like scikit-learn, XGBoost, etc. require numeric |
Ordinal
categories (Low, Medium, High) |
Label
Encoding |
There is
an order that must be preserved |
Nominal
categories (Color, City) |
One-Hot
Encoding |
There is
no order, each value should be independent |
High-cardinality
features (Zip, ID, Product) |
Target/Frequency
Encoding |
Reduces
dimensionality compared to one-hot |
Preparing
for distance-based models (KNN) |
One-Hot or
Binary Encoding |
Avoid
false closeness from label encoding |
Tree-based
models (Random Forest, XGBoost) |
Label or
Target Encoding |
Trees are
not sensitive to monotonic encodings |
Example: Why You Must Encode
from
sklearn.linear_model import LogisticRegression
import
pandas as pd
df = pd.DataFrame({
'Color': ['Red', 'Green', 'Blue'],
'Label': [1, 0, 1]
})
# This will
raise an error:
model =
LogisticRegression()
model.fit(df[['Color']],
df['Label']) # Error: Cannot handle
strings
Fix it with encoding:
df_encoded =
pd.get_dummies(df, columns=['Color'])
model.fit(df_encoded,
df['Label']) #Works fine
Why Encode? |
When to Do It? |
Models
can’t handle strings |
Before
training any non-NLP model |
Maintain
ordinal/numerical meaning |
When
feature has ranked categories |
Reduce dimensionality |
For
features with many unique values |
Avoid
misleading the model |
Always,
when feeding categorical data |
Different Encoding Techniques
- Label Encoding (Ordinal data) - Converts each category into
an integer (e.g., Low = 0, Medium = 1, High = 2).
- When to use?
- When categories
have natural orders (ordinal).
- Examples: ["Low",
"Medium", "High"], ["Poor",
"Average", "Good"].
Python Example:
import
pandas as pd
from
sklearn.preprocessing import LabelEncoder
df = pd.DataFrame({'Size': ['Small', 'Medium', 'Large', 'Medium', 'Small']})
encoder = LabelEncoder()
df['Size_LabelEncoded']
= encoder.fit_transform(df['Size'])
print(df)
- One-Hot Encoding - Creates
separate binary columns (0/1) for each category
- When to use?
- For
nominal data (no inherent order).
- Examples: ["Red",
"Blue", "Green"], ["Dog", "Cat",
"Bird"].
Python Example:
from
sklearn.preprocessing import RobustScaler
df = pd.DataFrame({'Score': [1, 2, 3, 100]})
scaler =
RobustScaler()
df['Robust_Score']
= scaler.fit_transform(df[['Score']])
print(df)
- Target Encoding (Mean Encoding)
- Replaces
each category with the mean of the target variable for that
category.
- When to use?
- For high-cardinality
features (e.g., hundreds of categories like ZIP codes).
- Must be used carefully to avoid
data leakage.
Python Example:
df =
pd.DataFrame({
'City': ['Delhi', 'Mumbai', 'Chennai',
'Delhi', 'Chennai', 'Mumbai'],
'Sales': [100, 200, 150, 130, 160, 220]
})
# Target encoding: average sales per city
mean_encoded
= df.groupby('City')['Sales'].mean()
df['City_TargetEncoded']
= df['City'].map(mean_encoded)
print(df)
- Frequency Encoding - Replaces
each category with the frequency (count) or proportion.
- When to use?
- Also for high-cardinality
categorical columns.
- Safer than target encoding
(no leakage risk).
Python Example:
df =
pd.DataFrame({'Product': ['A', 'B', 'A', 'C', 'B', 'A']})
# Frequency encoding
freq =
df['Product'].value_counts()
df['Product_FrequencyEncoded']
= df['Product'].map(freq)
print(df)
Link to Data Preprocessing Home
No comments:
Post a Comment