Part 14: Data Manipulation in Categorical Data Management

digitado ⋅ 15 de March de 2026

How Category Encoding and Label Handling Influence Bias and Model Stability

Machine learning models do not understand text. They work with numbers. When your dataset contains categories like product types, customer segments, or geographic regions, you face a fundamental challenge: converting these text labels into a format algorithms can process. Get this wrong and your model learns spurious patterns, fails to generalize, or crashes on new data.

Categorical data management is more than just converting strings to numbers. It’s about preserving information, preventing bias, and ensuring your encoding scheme does not introduce artifacts that mislead your model. A poorly chosen encoding can make unrelated categories appear similar or create false ordinal relationships where none exist. Your model accuracy suffers not because the algorithm is flawed but because the data representation is broken.

In this article, we’ll explore categorical data handling in pandas: creating categorical types, accessing category codes, reordering categories, adding new categories, and removing unused ones. Each operation impacts memory efficiency, model performance, and analytical correctness. Understanding these techniques helps you prepare data that leads to stable, unbiased models.

Understanding Categorical Data Types

Pandas offers two ways to represent categorical data: as regular strings (object dtype) or as categorical dtype. The difference matters for both performance and functionality.

String representation stores each value as a separate string object. If you have a column with 1 million rows and only 5 unique categories, you’re storing 1 million separate string objects. This wastes memory and slows operations.

Categorical representation stores unique categories once and uses integer codes to reference them. Those same 1 million rows now store only 5 string objects plus 1 million small integers. Memory usage drops dramatically. Operations like groupby, merge, and filtering run faster.

Beyond efficiency, categorical dtype enables operations impossible with strings. You can define custom orderings, add categories without data, and ensure consistency across datasets. These capabilities are essential for proper model preparation.

1. Creating Categorical Data

Converting columns to categorical is straightforward but the details matter. You can convert existing data or create categorical columns from scratch.

Basic Categorical Conversion

import pandas as pd
import numpy as np

# Create sample data with repeated categories
data = pd.DataFrame({
    'product': ['Laptop', 'Mouse', 'Keyboard', 'Mouse', 'Laptop', 
                'Monitor', 'Keyboard', 'Mouse', 'Laptop', 'Monitor'],
    'size': ['Large', 'Small', 'Medium', 'Small', 'Large',
             'Large', 'Medium', 'Small', 'Large', 'Large'],
    'price': [1200, 25, 75, 25, 1200, 300, 75, 25, 1200, 300]
})
print("Original Data:")
print(data)
print(f"nData types:n{data.dtypes}")
print(f"nMemory usage:n{data.memory_usage(deep=True)}")

Output:

Original Data:
    product    size  price
0    Laptop   Large   1200
1     Mouse   Small     25
2  Keyboard  Medium     75
3     Mouse   Small     25
4    Laptop   Large   1200
5   Monitor   Large    300
6  Keyboard  Medium     75
7     Mouse   Small     25
8    Laptop   Large   1200
9   Monitor   Large    300

Data types:
product    object
size       object
price       int64
dtype: object
Memory usage:
Index      132
product    730
size       650
price       80
dtype: int64

Converting to Categorical

# Convert to categorical
data_cat = data.copy()
data_cat['product'] = pd.Categorical(data_cat['product'])
data_cat['size'] = pd.Categorical(data_cat['size'])

print("nAfter Categorical Conversion:")
print(data_cat)
print(f"nData types:n{data_cat.dtypes}")
print(f"nMemory usage:n{data_cat.memory_usage(deep=True)}")
# Calculate memory savings
original_memory = data.memory_usage(deep=True).sum()
categorical_memory = data_cat.memory_usage(deep=True).sum()
savings = (1 - categorical_memory / original_memory) * 100
print(f"nMemory Savings: {savings:.1f}%")

Output:

After Categorical Conversion:
    product    size  price
0    Laptop   Large   1200
1     Mouse   Small     25
2  Keyboard  Medium     75
3     Mouse   Small     25
4    Laptop   Large   1200
5   Monitor   Large    300
6  Keyboard  Medium     75
7     Mouse   Small     25
8    Laptop   Large   1200
9   Monitor   Large    300

Data types:
product    category
size       category
price         int64
dtype: object
Memory usage:
Index      132
product    362
size       342
price       80
dtype: int64
Memory Savings: 42.1%

Specifying Categories Upfront

# Define categories explicitly
all_products = ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Tablet', 'Phone']
product_cat = pd.Categorical(
    data['product'],
    categories=all_products,
    ordered=False
)

print("nCategorical with Explicit Categories:")
print(product_cat)
print(f"nAll categories: {product_cat.categories.tolist()}")
print(f"Used categories: {product_cat.value_counts().index.tolist()}")
print(f"Unused categories: {set(all_products) - set(data['product'].unique())}")

Output:

Categorical with Explicit Categories:
['Laptop', 'Mouse', 'Keyboard', 'Mouse', 'Laptop', 'Monitor', 'Keyboard', 'Mouse', 'Laptop', 'Monitor']
Categories (6, object): ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Tablet', 'Phone']

All categories: ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Tablet', 'Phone']
Used categories: ['Laptop', 'Mouse', 'Monitor', 'Keyboard']
Unused categories: {'Phone', 'Tablet'}

Ordered Categories

# Create ordered categorical for size
size_order = ['Small', 'Medium', 'Large']
data_cat['size_ordered'] = pd.Categorical(
    data['size'],
    categories=size_order,
    ordered=True
)

print("nOrdered Categorical (Size):")
print(data_cat[['size_ordered']])
print(f"nIs ordered: {data_cat['size_ordered'].cat.ordered}")
print(f"nCategories: {data_cat['size_ordered'].cat.categories.tolist()}")
# Comparison works with ordered categories
print("nComparison operations:")
print(data_cat['size_ordered'] >= 'Medium')

Output:

Ordered Categorical (Size):
  size_ordered
0        Large
1        Small
2       Medium
3        Small
4        Large
5        Large
6       Medium
7        Small
8        Large
9        Large

Is ordered: True
Categories: ['Small', 'Medium', 'Large']
Comparison operations:
0     True
1    False
2     True
3    False
4     True
5     True
6     True
7    False
8     True
9     True
Name: size_ordered, dtype: bool

2. Getting Categorical Codes

Under the hood, categorical data uses integer codes to represent categories. Accessing these codes is essential for understanding how your data will be interpreted by models and for implementing custom encoding schemes.

Basic Code Access

# Create categorical data
categories = pd.DataFrame({
    'department': pd.Categorical(['Sales', 'Engineering', 'Marketing', 
                                  'Sales', 'Engineering', 'HR',
                                  'Marketing', 'Sales', 'HR', 'Engineering']),
    'performance': pd.Categorical(['Good', 'Excellent', 'Fair', 
                                   'Excellent', 'Good', 'Fair',
                                   'Good', 'Fair', 'Excellent', 'Good'])
})

print("Categorical Data:")
print(categories)
print()
# Get codes
print("Department Codes:")
print(categories['department'].cat.codes)
print()
print("Performance Codes:")
print(categories['performance'].cat.codes)
print()
# Show category to code mapping
print("Department Category Mapping:")
dept_categories = categories['department'].cat.categories
for i, cat in enumerate(dept_categories):
    print(f"  {cat}: {i}")

Output:

Categorical Data:
     department performance
0         Sales        Good
1  Engineering   Excellent
2    Marketing        Fair
3         Sales   Excellent
4  Engineering        Good
5            HR        Fair
6    Marketing        Good
7         Sales        Fair
8            HR   Excellent
9  Engineering        GoodDepartment Codes:
0    3
1    0
2    2
3    3
4    0
5    1
6    2
7    3
8    1
9    0
dtype: int8
Performance Codes:
0    1
1    0
2    2
3    0
4    1
5    2
6    1
7    2
8    0
9    1
dtype: int8
Department Category Mapping:
  Engineering: 0
  HR: 1
  Marketing: 2
  Sales: 3

Using Codes for Analysis

# Create DataFrame with codes
categories_with_codes = categories.copy()
categories_with_codes['dept_code'] = categories['department'].cat.codes
categories_with_codes['perf_code'] = categories['performance'].cat.codes

print("nData with Codes:")
print(categories_with_codes)
print()
# Group by codes
print("Average Performance Code by Department:")
avg_perf = categories_with_codes.groupby('department')['perf_code'].mean().round(2)
print(avg_perf)
print()
# Map codes back to categories
print("Code to Category Lookup:")
dept_lookup = dict(enumerate(categories['department'].cat.categories))
print(f"Department codes: {dept_lookup}")

Output:

Data with Codes:
     department performance  dept_code  perf_code
0         Sales        Good          3          1
1  Engineering   Excellent          0          0
2    Marketing        Fair          2          2
3         Sales   Excellent          3          0
4  Engineering        Good          0          1
5            HR        Fair          1          2
6    Marketing        Good          2          1
7         Sales        Fair          3          2
8            HR   Excellent          1          0
9  Engineering        Good          0          1

Average Performance Code by Department:
department
Engineering    0.67
HR             1.00
Marketing      1.50
Sales          1.00
Name: perf_code, dtype: float64
Code to Category Lookup:
Department codes: {0: 'Engineering', 1: 'HR', 2: 'Marketing', 3: 'Sales'}

Codes vs Values

# Demonstrate difference between codes and values
sample_cat = pd.Categorical(['B', 'A', 'C', 'A', 'B'], 
                            categories=['A', 'B', 'C'],
                            ordered=True)
print("Categorical Values:")
print(sample_cat)
print()
print("Codes (based on category order):")
print(sample_cat.codes)
print()
print("Category order determines codes:")
for i, cat in enumerate(sample_cat.categories):
    print(f"  '{cat}' -> code {i}")

Output:

Categorical Values:
['B', 'A', 'C', 'A', 'B']
Categories (3, object): ['A' < 'B' < 'C']

Codes (based on category order):
[1 0 2 0 1]
Category order determines codes:
  'A' -> code 0
  'B' -> code 1
  'C' -> code 2

3. Reordering Categories

The order of categories matters for ordered categoricals and affects how codes are assigned. Reordering changes the mapping between categories and their numeric representations.

Basic Reordering

# Create ordered categorical with initial order
ratings = pd.Categorical(
    ['Good', 'Poor', 'Excellent', 'Fair', 'Good', 'Poor', 'Excellent'],
    categories=['Poor', 'Fair', 'Good', 'Excellent'],
    ordered=True
)

print("Original Order:")
print(ratings)
print(f"Categories: {ratings.categories.tolist()}")
print(f"Codes: {ratings.codes}")
print()
# Reorder categories
ratings_reordered = ratings.reorder_categories(['Excellent', 'Good', 'Fair', 'Poor'])
print("After Reordering:")
print(ratings_reordered)
print(f"Categories: {ratings_reordered.categories.tolist()}")
print(f"Codes: {ratings_reordered.codes}")

Output:

Original Order:
['Good', 'Poor', 'Excellent', 'Fair', 'Good', 'Poor', 'Excellent']
Categories (4, object): ['Poor' < 'Fair' < 'Good' < 'Excellent']
Codes: [2 0 3 1 2 0 3]
After Reordering:
['Good', 'Poor', 'Excellent', 'Fair', 'Good', 'Poor', 'Excellent']
Categories (4, object): ['Excellent' < 'Good' < 'Fair' < 'Poor']
Codes: [1 3 0 2 1 3 0]

Practical Example: Priority Levels

# Create task priority data
tasks = pd.DataFrame({
    'task': ['Task A', 'Task B', 'Task C', 'Task D', 'Task E'],
    'priority': ['Medium', 'High', 'Low', 'High', 'Medium']
})

# Initial categorical (alphabetical)
tasks['priority_cat'] = pd.Categorical(tasks['priority'], ordered=True)
print("Initial Priority (Alphabetical Order):")
print(tasks)
print(f"Categories: {tasks['priority_cat'].cat.categories.tolist()}")
print(f"Codes: {tasks['priority_cat'].cat.codes.tolist()}")
print()
# Reorder to logical priority order
tasks['priority_cat'] = tasks['priority_cat'].cat.reorder_categories(
    ['Low', 'Medium', 'High'],
    ordered=True
)
print("After Reordering (Logical Priority):")
print(tasks)
print(f"Categories: {tasks['priority_cat'].cat.categories.tolist()}")
print(f"Codes: {tasks['priority_cat'].cat.codes.tolist()}")
print()
# Now comparisons work logically
print("High Priority Tasks (Priority >= 'Medium'):")
high_priority = tasks[tasks['priority_cat'] >= 'Medium']
print(high_priority)

Output:

Initial Priority (Alphabetical Order):
    task priority priority_cat
0  Task A   Medium       Medium
1  Task B     High         High
2  Task C      Low          Low
3  Task D     High         High
4  Task E   Medium       Medium
Categories: ['High', 'Low', 'Medium']
Codes: [2, 0, 1, 0, 2]

After Reordering (Logical Priority):
    task priority priority_cat
0  Task A   Medium       Medium
1  Task B     High         High
2  Task C      Low          Low
3  Task D     High         High
4  Task E   Medium       Medium
Categories: ['Low', 'Medium', 'High']
Codes: [1, 2, 0, 2, 1]
High Priority Tasks (Priority >= 'Medium'):
    task priority priority_cat
0  Task A   Medium       Medium
1  Task B     High         High
3  Task D     High         High
4  Task E   Medium       Medium

Reordering for Analysis

# Create survey response data
responses = pd.DataFrame({
    'question': ['Q1', 'Q1', 'Q1', 'Q2', 'Q2', 'Q2'],
    'response': ['Agree', 'Disagree', 'Strongly Agree', 
                'Neutral', 'Agree', 'Strongly Disagree']
})

# Convert to ordered categorical with proper order
response_order = ['Strongly Disagree', 'Disagree', 'Neutral', 'Agree', 'Strongly Agree']
responses['response_cat'] = pd.Categorical(
    responses['response'],
    categories=response_order,
    ordered=True
)
print("Survey Responses:")
print(responses)
print()
# Calculate numeric scores based on order
responses['score'] = responses['response_cat'].cat.codes
print("With Numeric Scores:")
print(responses)
print()
# Calculate average by question
print("Average Score by Question:")
avg_scores = responses.groupby('question')['score'].mean()
print(avg_scores)

Output:

Survey Responses:
  question          response        response_cat
0       Q1             Agree               Agree
1       Q1          Disagree            Disagree
2       Q1    Strongly Agree      Strongly Agree
3       Q2           Neutral             Neutral
4       Q2             Agree               Agree
5       Q2  Strongly Disagree   Strongly Disagree

With Numeric Scores:
  question          response        response_cat  score
0       Q1             Agree               Agree      3
1       Q1          Disagree            Disagree      1
2       Q1    Strongly Agree      Strongly Agree      4
3       Q2           Neutral             Neutral      2
4       Q2             Agree               Agree      3
5       Q2  Strongly Disagree   Strongly Disagree      0
Average Score by Question:
question
Q1    2.666667
Q2    1.666667
Name: score, dtype: float64

4. Adding New Categories

When working with categorical data, you may need to add categories that don’t exist in your current dataset. This is essential for ensuring consistency across datasets or preparing for future values.

Adding Categories Without Data

# Create initial categorical
products = pd.Categorical(['Laptop', 'Mouse', 'Keyboard', 'Laptop'])
print("Initial Categories:")
print(f"Categories: {products.categories.tolist()}")
print(f"Values: {products.tolist()}")
print()

# Add new category
products_extended = products.add_categories('Monitor')
print("After Adding 'Monitor':")
print(f"Categories: {products_extended.categories.tolist()}")
print(f"Values: {products_extended.tolist()}")
print()
# Add multiple categories
products_extended = products.add_categories(['Tablet', 'Phone'])
print("After Adding Multiple Categories:")
print(f"Categories: {products_extended.categories.tolist()}")

Output:

Initial Categories:
Categories: ['Keyboard', 'Laptop', 'Mouse']
Values: ['Laptop', 'Mouse', 'Keyboard', 'Laptop']

After Adding 'Monitor':
Categories: ['Keyboard', 'Laptop', 'Mouse', 'Monitor']
Values: ['Laptop', 'Mouse', 'Keyboard', 'Laptop']
After Adding Multiple Categories:
Categories: ['Keyboard', 'Laptop', 'Mouse', 'Tablet', 'Phone']

Using Added Categories

# Create DataFrame with categorical column
df = pd.DataFrame({
    'product': pd.Categorical(['Laptop', 'Mouse', 'Keyboard'])
})

print("Original Data:")
print(df)
print(f"Categories: {df['product'].cat.categories.tolist()}")
print()
# Add new category
df['product'] = df['product'].cat.add_categories('Monitor')
print("After Adding 'Monitor' Category:")
print(f"Categories: {df['product'].cat.categories.tolist()}")
print()
# Now we can use the new category
df.loc[3] = 'Monitor'
print("After Adding Monitor Value:")
print(df)

Output:

Original Data:
    product
0    Laptop
1     Mouse
2  Keyboard
Categories: ['Keyboard', 'Laptop', 'Mouse']

After Adding 'Monitor' Category:
Categories: ['Keyboard', 'Laptop', 'Mouse', 'Monitor']
After Adding Monitor Value:
    product
0    Laptop
1     Mouse
2  Keyboard
3   Monitor

Preventing Errors with Pre-defined Categories

# Example: Preparing for merge operations
sales_q1 = pd.DataFrame({
    'product': pd.Categorical(['Laptop', 'Mouse']),
    'sales': [10, 50]
})

sales_q2 = pd.DataFrame({
    'product': pd.Categorical(['Keyboard', 'Monitor']),
    'sales': [30, 20]
})
print("Q1 Sales:")
print(sales_q1)
print(f"Categories: {sales_q1['product'].cat.categories.tolist()}")
print()
print("Q2 Sales:")
print(sales_q2)
print(f"Categories: {sales_q2['product'].cat.categories.tolist()}")
print()
# Define all possible products
all_products = ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Tablet']
# Ensure both have same categories
sales_q1['product'] = sales_q1['product'].cat.add_categories(
    [p for p in all_products if p not in sales_q1['product'].cat.categories]
)
sales_q2['product'] = sales_q2['product'].cat.add_categories(
    [p for p in all_products if p not in sales_q2['product'].cat.categories]
)
print("After Standardizing Categories:")
print(f"Q1 Categories: {sales_q1['product'].cat.categories.tolist()}")
print(f"Q2 Categories: {sales_q2['product'].cat.categories.tolist()}")
print()
# Now can safely concatenate
combined = pd.concat([sales_q1, sales_q2], ignore_index=True)
print("Combined Sales:")
print(combined)

Output:

Q1 Sales:
  product  sales
0  Laptop     10
1   Mouse     50
Categories: ['Laptop', 'Mouse']

Q2 Sales:
    product  sales
0  Keyboard     30
1   Monitor     20
Categories: ['Keyboard', 'Monitor']
After Standardizing Categories:
Q1 Categories: ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Tablet']
Q2 Categories: ['Keyboard', 'Monitor', 'Laptop', 'Mouse', 'Tablet']
Combined Sales:
    product  sales
0    Laptop     10
1     Mouse     50
2  Keyboard     30
3   Monitor     20

5. Removing Unused Categories

Over time, categorical columns may accumulate unused categories from operations like filtering or merging. These unused categories can cause issues with analysis and modeling. Removing them keeps your data clean and prevents confusion.

Identifying and Removing Unused Categories

# Create data with unused categories
all_sizes = ['XS', 'S', 'M', 'L', 'XL', 'XXL']
inventory = pd.DataFrame({
    'item': ['Shirt A', 'Shirt B', 'Shirt C', 'Shirt D'],
    'size': pd.Categorical(['M', 'L', 'M', 'L'], categories=all_sizes)
})

print("Inventory Data:")
print(inventory)
print(f"nAll Categories: {inventory['size'].cat.categories.tolist()}")
print(f"Used Categories: {inventory['size'].unique().tolist()}")
print(f"Value Counts:n{inventory['size'].value_counts()}")
print()
# Remove unused categories
inventory['size_clean'] = inventory['size'].cat.remove_unused_categories()
print("After Removing Unused Categories:")
print(f"All Categories: {inventory['size_clean'].cat.categories.tolist()}")
print(f"Value Counts:n{inventory['size_clean'].value_counts()}")

Output:

Inventory Data:
      item size
0  Shirt A    M
1  Shirt B    L
2  Shirt C    M
3  Shirt D    L

All Categories: ['XS', 'S', 'M', 'L', 'XL', 'XXL']
Used Categories: ['M', 'L']
Value Counts:
M    2
L    2
S    0
XS   0
XL   0
XXL  0
Name: size, dtype: int64
After Removing Unused Categories:
All Categories: ['M', 'L']
Value Counts:
M    2
L    2
Name: size, dtype: int64

Practical Example: After Filtering

# Create full product catalog
catalog = pd.DataFrame({
    'product': pd.Categorical(['Laptop', 'Mouse', 'Keyboard', 'Monitor', 
                               'Tablet', 'Phone', 'Headphones', 'Webcam']),
    'price': [1200, 25, 75, 300, 500, 800, 150, 80],
    'in_stock': [True, True, True, True, False, False, True, False]
})

print("Full Catalog:")
print(catalog)
print(f"Product Categories: {catalog['product'].cat.categories.tolist()}")
print()
# Filter to in-stock items
in_stock = catalog[catalog['in_stock']].copy()
print("In-Stock Items (Before Cleanup):")
print(in_stock)
print(f"Product Categories: {in_stock['product'].cat.categories.tolist()}")
print()
# Remove unused categories
in_stock['product'] = in_stock['product'].cat.remove_unused_categories()
print("In-Stock Items (After Cleanup):")
print(in_stock)
print(f"Product Categories: {in_stock['product'].cat.categories.tolist()}")

Output:

Full Catalog:
      product  price  in_stock
0      Laptop   1200      True
1       Mouse     25      True
2    Keyboard     75      True
3     Monitor    300      True
4      Tablet    500     False
5       Phone    800     False
6  Headphones    150      True
7      Webcam     80     False
Product Categories: ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Tablet', 'Phone', 'Headphones', 'Webcam']

In-Stock Items (Before Cleanup):
      product  price  in_stock
0      Laptop   1200      True
1       Mouse     25      True
2    Keyboard     75      True
3     Monitor    300      True
6  Headphones    150      True
Product Categories: ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Tablet', 'Phone', 'Headphones', 'Webcam']
In-Stock Items (After Cleanup):
      product  price  in_stock
0      Laptop   1200      True
1       Mouse     25      True
2    Keyboard     75      True
3     Monitor    300      True
6  Headphones    150      True
Product Categories: ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Headphones']

Why Removing Unused Categories Matters

# Demonstrate impact on operations
# Create data with unused categories
data = pd.DataFrame({
    'category': pd.Categorical(['A', 'B', 'A', 'B'], 
                               categories=['A', 'B', 'C', 'D', 'E'])
})


print("Data with Unused Categories:")
print(data)
print()
print("Value Counts (includes unused):")
print(data['category'].value_counts())
print()
# After removing unused
data['category_clean'] = data['category'].cat.remove_unused_categories()
print("Value Counts (after cleanup):")
print(data['category_clean'].value_counts())
print()
# Impact on dummy variables
print("Dummy Variables (before cleanup):")
dummies_before = pd.get_dummies(data['category'], prefix='cat')
print(dummies_before)
print(f"Shape: {dummies_before.shape}")
print()
print("Dummy Variables (after cleanup):")
dummies_after = pd.get_dummies(data['category_clean'], prefix='cat')
print(dummies_after)
print(f"Shape: {dummies_after.shape}")

Output:

Data with Unused Categories:
  category
0        A
1        B
2        A
3        B

Value Counts (includes unused):
A    2
B    2
C    0
D    0
E    0
Name: category, dtype: int64
Value Counts (after cleanup):
A    2
B    2
Name: category_clean, dtype: int64
Dummy Variables (before cleanup):
   cat_A  cat_B  cat_C  cat_D  cat_E
0      1      0      0      0      0
1      0      1      0      0      0
2      1      0      0      0      0
3      0      1      0      0      0
Shape: (4, 5)
Dummy Variables (after cleanup):
   cat_A  cat_B
0      1      0
1      0      1
2      1      0
3      0      1
Shape: (4, 2)

Complete End-to-End Example: Customer Segmentation Pipeline

Here’s a comprehensive program demonstrating all categorical operations in a realistic customer analysis scenario.

import pandas as pd
import numpy as np

print("="*70)
print("CUSTOMER SEGMENTATION - CATEGORICAL DATA MANAGEMENT")
print("="*70)
# STEP 1: Create customer data with categorical variables
print("n" + "="*70)
print("STEP 1: LOAD CUSTOMER DATA")
print("="*70)
np.random.seed(42)
# Generate sample customer data
customer_data = pd.DataFrame({
    'customer_id': range(1, 21),
    'segment': np.random.choice(['Bronze', 'Silver', 'Gold', 'Platinum'], 20),
    'region': np.random.choice(['North', 'South', 'East', 'West'], 20),
    'product_interest': np.random.choice(['Electronics', 'Clothing', 'Home'], 20),
    'satisfaction': np.random.choice(['Very Satisfied', 'Satisfied', 'Neutral', 
                                     'Dissatisfied'], 20),
    'purchase_amount': np.random.randint(100, 1000, 20)
})
print("nOriginal Customer Data (first 10):")
print(customer_data.head(10))
print(f"nData Types:n{customer_data.dtypes}")
print(f"nMemory Usage: {customer_data.memory_usage(deep=True).sum()} bytes")

# STEP 2: Convert to categorical
print("n" + "="*70)
print("STEP 2: CONVERT TO CATEGORICAL")
print("="*70)
# Define all possible categories
all_segments = ['Bronze', 'Silver', 'Gold', 'Platinum', 'Diamond']
all_regions = ['North', 'South', 'East', 'West']
all_products = ['Electronics', 'Clothing', 'Home', 'Sports', 'Books']
satisfaction_levels = ['Very Dissatisfied', 'Dissatisfied', 'Neutral', 
                       'Satisfied', 'Very Satisfied']
# Convert with explicit categories
customer_data['segment_cat'] = pd.Categorical(
    customer_data['segment'],
    categories=all_segments,
    ordered=True
)
customer_data['region_cat'] = pd.Categorical(
    customer_data['region'],
    categories=all_regions,
    ordered=False
)
customer_data['product_cat'] = pd.Categorical(
    customer_data['product_interest'],
    categories=all_products,
    ordered=False
)
customer_data['satisfaction_cat'] = pd.Categorical(
    customer_data['satisfaction'],
    categories=satisfaction_levels,
    ordered=True
)
print("nAfter Categorical Conversion:")
print(customer_data.dtypes)
print(f"nMemory Usage: {customer_data.memory_usage(deep=True).sum()} bytes")
memory_saved = (1 - customer_data.memory_usage(deep=True).sum() / 
                customer_data.astype(str).memory_usage(deep=True).sum()) * 100
print(f"Memory Savings: {memory_saved:.1f}%")

# STEP 3: Access categorical codes
print("n" + "="*70)
print("STEP 3: EXAMINE CATEGORICAL CODES")
print("="*70)
print("nSegment Codes:")
segment_codes = customer_data[['segment_cat']].copy()
segment_codes['code'] = customer_data['segment_cat'].cat.codes
print(segment_codes.head(10))
print("nSegment Code Mapping:")
for i, seg in enumerate(customer_data['segment_cat'].cat.categories):
    count = (customer_data['segment_cat'].cat.codes == i).sum()
    print(f"  {seg} -> {i} (count: {count})")

# STEP 4: Reorder satisfaction for logical ordering
print("n" + "="*70)
print("STEP 4: REORDER CATEGORIES")
print("="*70)
print("Satisfaction Before Reorder:")
print(f"Categories: {customer_data['satisfaction_cat'].cat.categories.tolist()}")
print(f"Sample codes: {customer_data['satisfaction_cat'].cat.codes[:5].tolist()}")
# Already in correct order, but demonstrate the concept
print("nSatisfaction Order (Low to High):")
for i, level in enumerate(satisfaction_levels):
    print(f"  {i}: {level}")
# Create numeric scores based on satisfaction
customer_data['satisfaction_score'] = customer_data['satisfaction_cat'].cat.codes
print("nCustomer Data with Satisfaction Scores:")
print(customer_data[['customer_id', 'satisfaction_cat', 'satisfaction_score']].head(10))

# STEP 5: Identify and analyze category usage
print("n" + "="*70)
print("STEP 5: ANALYZE CATEGORY USAGE")
print("="*70)
print("Segment Distribution:")
print(customer_data['segment_cat'].value_counts())
print(f"nUnused segments: {set(all_segments) - set(customer_data['segment'].unique())}")
print("nProduct Interest Distribution:")
print(customer_data['product_cat'].value_counts())
print(f"nUnused products: {set(all_products) - set(customer_data['product_interest'].unique())}")

# STEP 6: Filter data and remove unused categories
print("n" + "="*70)
print("STEP 6: FILTER AND CLEANUP")
print("="*70)
# Filter to high-value customers (Satisfied+)
high_satisfaction = customer_data[
    customer_data['satisfaction_cat'] >= 'Satisfied'
].copy()
print(f"nHigh Satisfaction Customers: {len(high_satisfaction)}")
print("nSatisfaction Categories (before cleanup):")
print(high_satisfaction['satisfaction_cat'].cat.categories.tolist())
print("nValue Counts:")
print(high_satisfaction['satisfaction_cat'].value_counts())
# Remove unused categories
high_satisfaction['satisfaction_cat'] = (
    high_satisfaction['satisfaction_cat'].cat.remove_unused_categories()
)
print("nSatisfaction Categories (after cleanup):")
print(high_satisfaction['satisfaction_cat'].cat.categories.tolist())

# STEP 7: Add new category for future use
print("n" + "="*70)
print("STEP 7: ADD NEW CATEGORIES")
print("="*70)
print("Current Segments:")
print(customer_data['segment_cat'].cat.categories.tolist())
# Check if Diamond is already in categories
if 'Diamond' not in customer_data['segment_cat'].cat.categories:
    customer_data['segment_cat'] = customer_data['segment_cat'].cat.add_categories(['Diamond'])
print("nAfter Adding 'Diamond' Tier:")
print(customer_data['segment_cat'].cat.categories.tolist())
# Simulate upgrade
customer_data.loc[customer_data['purchase_amount'] > 900, 'segment_cat'] = 'Diamond'
print("nSegment Distribution After Upgrades:")
print(customer_data['segment_cat'].value_counts())

# STEP 8: Create dummy variables for modeling
print("n" + "="*70)
print("STEP 8: CREATE DUMMY VARIABLES")
print("="*70)
# Before cleanup - includes all categories
dummies_all = pd.get_dummies(customer_data['segment_cat'], prefix='segment')
print(f"Dummy variables (all categories): {dummies_all.shape}")
print(dummies_all.head())
# After removing unused
customer_data['segment_clean'] = (
    customer_data['segment_cat'].cat.remove_unused_categories()
)
dummies_clean = pd.get_dummies(customer_data['segment_clean'], prefix='segment')
print(f"nDummy variables (used categories): {dummies_clean.shape}")
print(dummies_clean.head())

# STEP 9: Aggregate analysis by categories
print("n" + "="*70)
print("STEP 9: AGGREGATE ANALYSIS")
print("="*70)
print("Average Purchase Amount by Segment:")
segment_stats = customer_data.groupby('segment_cat')['purchase_amount'].agg([
    ('count', 'count'),
    ('mean', 'mean'),
    ('total', 'sum')
]).round(2)
print(segment_stats)
print("nAverage Satisfaction Score by Region:")
region_satisfaction = customer_data.groupby('region_cat')['satisfaction_score'].mean().round(2)
print(region_satisfaction)
print("nCustomer Count by Product Interest and Segment:")
cross_tab = pd.crosstab(
    customer_data['product_cat'],
    customer_data['segment_cat'],
    margins=True
)
print(cross_tab)

# STEP 10: Model preparation summary
print("n" + "="*70)
print("STEP 10: MODEL PREPARATION SUMMARY")
print("="*70)
# Create feature matrix
features = customer_data[['segment_clean', 'region_cat', 'product_cat', 
                          'satisfaction_score', 'purchase_amount']].copy()
# Get dummies for all categorical variables
features_encoded = pd.get_dummies(
    features,
    columns=['segment_clean', 'region_cat', 'product_cat'],
    drop_first=True
)
print("Encoded Feature Matrix:")
print(features_encoded.head())
print(f"nShape: {features_encoded.shape}")
print(f"nFeature Names: {features_encoded.columns.tolist()}")
print("nData Ready for Modeling:")
print(f"  Total Customers: {len(features_encoded)}")
print(f"  Total Features: {features_encoded.shape[1]}")
print(f"  Categorical Variables Encoded: 3")
print(f"  Numerical Features: 2")
print("n" + "="*70)
print("ANALYSIS COMPLETE")
print("="*70)

Complete Output:

======================================================================
CUSTOMER SEGMENTATION - CATEGORICAL DATA MANAGEMENT
======================================================================

======================================================================
STEP 1: LOAD CUSTOMER DATA
======================================================================
Original Customer Data (first 10):
   customer_id  segment region product_interest  satisfaction  purchase_amount
0            1   Silver  North      Electronics  Very Satisfied              474
1            2     Gold   East         Clothing  Very Satisfied              832
2            3   Silver  South         Clothing         Neutral              211
3            4  Platinum  South             Home    Dissatisfied              858
4            5   Bronze   East      Electronics       Satisfied              562
5            6   Silver  North             Home  Very Satisfied              748
6            7     Gold  South         Clothing         Neutral              366
7            8   Silver  North      Electronics  Very Satisfied              832
8            9  Platinum   West         Clothing       Satisfied              252
9           10   Bronze  North             Home         Neutral              861
Data Types:
customer_id           int64
segment              object
region               object
product_interest     object
satisfaction         object
purchase_amount       int64
dtype: object
Memory Usage: 3652 bytes
======================================================================
STEP 2: CONVERT TO CATEGORICAL
======================================================================
After Categorical Conversion:
customer_id           int64
segment              object
region               object
product_interest     object
satisfaction         object
purchase_amount       int64
segment_cat        category
region_cat         category
product_cat        category
satisfaction_cat   category
dtype: object
Memory Usage: 4548 bytes
Memory Savings: 37.4%
======================================================================
STEP 3: EXAMINE CATEGORICAL CODES
======================================================================
Segment Codes:
  segment_cat  code
0      Silver     1
1        Gold     2
2      Silver     1
3    Platinum     3
4      Bronze     0
5      Silver     1
6        Gold     2
7      Silver     1
8    Platinum     3
9      Bronze     0
Segment Code Mapping:
  Bronze -> 0 (count: 6)
  Silver -> 1 (count: 6)
  Gold -> 2 (count: 3)
  Platinum -> 3 (count: 5)
  Diamond -> 4 (count: 0)
======================================================================
STEP 4: REORDER CATEGORIES
======================================================================
Satisfaction Before Reorder:
Categories: ['Very Dissatisfied', 'Dissatisfied', 'Neutral', 'Satisfied', 'Very Satisfied']
Sample codes: [4, 4, 2, 1, 3]
Satisfaction Order (Low to High):
  0: Very Dissatisfied
  1: Dissatisfied
  2: Neutral
  3: Satisfied
  4: Very Satisfied
Customer Data with Satisfaction Scores:
   customer_id satisfaction_cat  satisfaction_score
0            1   Very Satisfied                   4
1            2   Very Satisfied                   4
2            3          Neutral                   2
3            4     Dissatisfied                   1
4            5        Satisfied                   3
5            6   Very Satisfied                   4
6            7          Neutral                   2
7            8   Very Satisfied                   4
8            9        Satisfied                   3
9           10          Neutral                   2
======================================================================
STEP 5: ANALYZE CATEGORY USAGE
======================================================================
Segment Distribution:
Silver      6
Bronze      6
Platinum    5
Gold        3
Diamond     0
Name: segment_cat, dtype: int64
Unused segments: {'Diamond'}
Product Interest Distribution:
Clothing       8
Electronics    7
Home           5
Sports         0
Books          0
Name: product_cat, dtype: int64
Unused products: {'Sports', 'Books'}
======================================================================
STEP 6: FILTER AND CLEANUP
======================================================================
High Satisfaction Customers: 12
Satisfaction Categories (before cleanup):
['Very Dissatisfied', 'Dissatisfied', 'Neutral', 'Satisfied', 'Very Satisfied']
Value Counts:
Very Satisfied    7
Satisfied         5
Dissatisfied      0
Neutral           0
Very Dissatisfied 0
Name: satisfaction_cat, dtype: int64
Satisfaction Categories (after cleanup):
['Satisfied', 'Very Satisfied']
======================================================================
STEP 7: ADD NEW CATEGORIES
======================================================================
Current Segments:
['Bronze', 'Silver', 'Gold', 'Platinum', 'Diamond']
After Adding 'Diamond' Tier:
['Bronze', 'Silver', 'Gold', 'Platinum', 'Diamond']
Segment Distribution After Upgrades:
Silver      6
Bronze      6
Platinum    4
Gold        3
Diamond     1
Name: segment_cat, dtype: int64
======================================================================
STEP 8: CREATE DUMMY VARIABLES
======================================================================
Dummy variables (all categories): (20, 5)
   segment_Bronze  segment_Silver  segment_Gold  segment_Platinum  segment_Diamond
0               0               1             0                 0                0
1               0               0             1                 0                0
2               0               1             0                 0                0
3               0               0             0                 1                0
4               1               0             0                 0                0
Dummy variables (used categories): (20, 5)
   segment_Bronze  segment_Silver  segment_Gold  segment_Platinum  segment_Diamond
0               0               1             0                 0                0
1               0               0             1                 0                0
2               0               1             0                 0                0
3               0               0             0                 1                0
4               1               0             0                 0                0
======================================================================
STEP 9: AGGREGATE ANALYSIS
======================================================================
Average Purchase Amount by Segment:
             count    mean     total
segment_cat                         
Bronze         6.0  510.83   3065.0
Silver         6.0  503.50   3021.0
Gold           3.0  489.67   1469.0
Platinum       4.0  551.50   2206.0
Diamond        1.0  944.00    944.0
Average Satisfaction Score by Region:
region_cat
North    2.86
South    2.00
East     3.17
West     2.50
Name: satisfaction_score, dtype: float64
Customer Count by Product Interest and Segment:
segment_cat   Bronze  Gold  Platinum  Silver  Diamond  All
product_cat                                               
Clothing           3     1         2       2        0    8
Electronics        2     2         1       2        0    7
Home               1     0         1       2        1    5
All                6     3         4       6        1   20
======================================================================
STEP 10: MODEL PREPARATION SUMMARY
======================================================================
Encoded Feature Matrix:
   satisfaction_score  purchase_amount  segment_clean_Silver  segment_clean_Gold  segment_clean_Platinum  segment_clean_Diamond  region_cat_South  region_cat_East  region_cat_West  product_cat_Electronics  product_cat_Home
0                   4              474                     1                   0                       0                      0                 0                0                0                        1                 0
1                   4              832                     0                   1                       0                      0                 0                1                0                        0                 0
2                   2              211                     1                   0                       0                      0                 1                0                0                        0                 0
3                   1              858                     0                   0                       1                      0                 1                0                0                        0                 1
4                   3              562                     0                   0                       0                      0                 0                1                0                        1                 0
Shape: (20, 11)
Feature Names: ['satisfaction_score', 'purchase_amount', 'segment_clean_Silver', 'segment_clean_Gold', 'segment_clean_Platinum', 'segment_clean_Diamond', 'region_cat_South', 'region_cat_East', 'region_cat_West', 'product_cat_Electronics', 'product_cat_Home']
Data Ready for Modeling:
  Total Customers: 20
  Total Features: 11
  Categorical Variables Encoded: 3
  Numerical Features: 2
======================================================================
ANALYSIS COMPLETE
======================================================================

How Category Encoding Influences Model Behavior

The way you encode categorical variables directly impacts what patterns your model can learn and how well it generalizes.

Label Encoding Creates False Orderings

Simply converting categories to integers (Bronze=0, Silver=1, Gold=2, Platinum=3) implies an arithmetic relationship. The model learns that Platinum is “three times” Bronze, which may not reflect reality. For truly nominal categories like product types or regions, this false ordering introduces bias.

One-Hot Encoding Prevents Bias

Creating binary columns for each category (dummy variables) treats categories as independent. No false arithmetic relationships exist. This is appropriate for nominal variables but increases dimensionality.

Ordered Categories Enable Comparison

For truly ordered categories like satisfaction levels or priority tiers, maintaining order lets you use comparison operators and enables ordinal encoding where the numeric codes reflect actual ranking.

Unused Categories Waste Resources

Categories defined but never present in your data create unnecessary dummy variables, increase memory usage, and may confuse models. Removing them keeps your feature space clean.

Inconsistent Categories Break Pipelines

If training data has categories that test data lacks or vice versa, your encoding scheme fails. Pre-defining all possible categories and handling them consistently prevents production errors.

Best Practices for Categorical Data Management

Convert to categorical early. As soon as you identify categorical columns, convert them. The memory savings and functionality benefits compound throughout analysis.
Define all categories upfront. Especially for nominal variables with known values, specify categories explicitly rather than inferring from data. This prevents inconsistencies across datasets.
Use ordered=True only when order matters. Don’t make categories ordered just because they convert to numbers nicely. Order should reflect real-world relationships.
Remove unused categories before modeling. After filtering or transforming data, clean up unused categories to prevent creating unnecessary features.
Document category mappings. Save the mapping between categories and codes. You’ll need it for interpreting model coefficients and handling new data.
Standardize categories across datasets. When combining data from multiple sources, ensure categorical columns have identical category sets and orderings.
Validate category consistency. Before merging or concatenating DataFrames with categorical columns, verify that categories match or explicitly handle differences.

Common Pitfalls to Avoid

Forgetting to specify categories when concatenating. Concatenating DataFrames with different categorical levels creates inconsistent results. Standardize categories first.
Using string operations on categorical columns. Many string methods don’t work on categorical dtype. Convert to string for manipulation, then back to categorical if needed.
Assuming alphabetical order is logical order. Categories default to alphabetical order unless you specify otherwise. This rarely matches real-world ordering for things like sizes or ratings.
Not removing unused categories after filtering. Filtered DataFrames retain the original categorical levels, which can confuse analysis and create extra dummy variables.
Mixing ordered and unordered categoricals. Be consistent. If satisfaction is ordered in one dataset, make it ordered everywhere. Inconsistency causes comparison errors.

Final Thoughts

Categorical data management is foundational to building stable, unbiased models. The operations we’ve covered — creating categoricals, accessing codes, reordering, adding categories, and removing unused ones — directly influence model behavior and performance.

Every encoding choice carries consequences. Label encoding introduces arithmetic relationships that may not exist. One-hot encoding prevents bias but increases dimensionality. Ordered categories enable meaningful comparisons but require thoughtful ordering. Unused categories waste resources and may cause errors. Inconsistent categories break production pipelines.

The key is thinking through your data’s meaning before choosing encoding strategies. Are categories truly ordered or just nominal? Do you know all possible values or might new ones appear? Will you combine data from multiple sources? These questions guide your categorical data management approach.

Proper categorical handling improves model stability by preventing spurious patterns from encoding artifacts. It reduces bias by not imposing false relationships between categories. It prevents errors by ensuring consistency across datasets. And it optimizes performance by reducing unnecessary features and memory usage.

Whether you’re building classification models, running regression analysis, or preparing data for visualization, categorical data management is essential. Master these techniques and you build a foundation for accurate, stable, interpretable models.

This guide is part of my ongoing series, “Data Manipulation in the Real World” where I focus on solving actual data engineering hurdles rather than toy examples. My goal is to give you practical Pandas skills that you can apply immediately to your professional projects.

Found this guide valuable? Your engagement helps other developers discover these techniques. If this article helped you understand window functions better, please follow my page, give it a clap, leave a comment with your questions or use cases, and share it with your network. I respond to every technical question.

Your engagement helps these guides reach practitioners who need them. What categorical encoding challenges have you faced? What strategies work best in your domain? Share your experiences in the comments below.

Part 14: Data Manipulation in Categorical Data Management was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Like 0

Liked Liked