Building a Sex Classifier from Workclass Categorical Features

===========================================================

In this tutorial, we’ll explore how to create a sex classifier based on workclass categorical features using logistic regression. We’ll cover the steps involved in encoding and selecting the most relevant columns for classification.

Problem Statement

The given dataset contains information about individuals, including their age, workclass, and other demographic details. The task is to build a classifier that can predict an individual’s sex based on their workclass features. However, the workclass feature has been encoded into multiple new columns through one-hot encoding.

Step 1: Data Preparation

First, let’s prepare our dataset by removing rows with missing values and encoding the categorical variables.

import pandas as pd
import numpy as np

# Load the dataset
df = pd.read_csv('your_data.csv')

# Remove rows with missing values
df = df.dropna()

# Define the categorical columns to be encoded
cat_cols = ['workclass', 'marital-status', 'occupation', 'relationship', 
             'race', 'sex', 'native-country']

# Encode sex column
df["Value"] = np.where((df["sex"] == 'Female'), 0, 1)

# Encode categorical columns
data = df.copy()
for col in cat_cols:
    data = pd.get_dummies(data, columns=[col], prefix=[col])

Step 2: Selecting Relevant Columns for Classification

Now that we have our dataset encoded, let’s select the most relevant workclass columns for classification.

# Define the workclass columns to be selected
workclass_cols = [i for i in data.columns if 'workclass' in i]

# Create a new dataframe with only the workclass columns and target variable
X_workclass = data[workclass_cols]
y_workclass = df['Value']

# Print the shape of X_workclass and y_workclass
print(f"Shape of X_workclass: {X_workclass.shape}")
print(f"Shape of y_workclass: {y_workclass.shape}")

Step 3: Building the Logistic Regression Model for Sex Classification Based on Workclass Features

Next, let’s build a logistic regression model to classify sex based on the selected workclass features.

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Split the data into training and testing sets
X_train_workclass, X_test_workclass, y_train_workclass, y_test_workclass = train_test_split(X_workclass, y_workclass, test_size=0.2, random_state=42)

# Create a logistic regression model
model_workclass = LogisticRegression()

# Train the model on the training data
model_workclass.fit(X_train_workclass, y_train_workclass)

# Make predictions on the testing data
y_pred_workclass = model_workclass.predict(X_test_workclass)

# Evaluate the model's performance
print(f"Accuracy: {accuracy_score(y_test_workclass, y_pred_workclass)}")
print(f"Classification Report:\n{classification_report(y_test_workclass, y_pred_workclass)}")

Step 4: Building the Logistic Regression Model for Sex Classification Based on All Features

Now that we have a model trained on the workclass features alone, let’s build another model to classify sex based on all feature columns.

# Define all feature columns (excluding target variable)
feature_cols = [col for col in data.columns if col != 'Value']

# Create a new dataframe with only the feature columns and target variable
X_all_features = data[feature_cols]
y_all_features = df['Value']

# Split the data into training and testing sets
X_train_all, X_test_all, y_train_all, y_test_all = train_test_split(X_all_features, y_all_features, test_size=0.2, random_state=42)

# Create a logistic regression model
model_all_features = LogisticRegression()

# Train the model on the training data
model_all_features.fit(X_train_all, y_train_all)

# Make predictions on the testing data
y_pred_all_features = model_all_features.predict(X_test_all)

# Evaluate the model's performance
print(f"Accuracy: {accuracy_score(y_test_all, y_pred_all_features)}")
print(f"Classification Report:\n{classification_report(y_test_all, y_pred_all_features)}")

Step 5: Combining Models for Improved Performance

To improve the performance of our models, let’s combine them using bagging and boosting techniques.

from sklearn.ensemble import BaggingClassifier, GradientBoostingClassifier

# Define a bagging classifier with logistic regression as base estimator
bagging_classifier = BaggingClassifier(base_estimator=LogisticRegression(), n_estimators=100)

# Train the bagging classifier on the training data
bagging_classifier.fit(X_train_workclass, y_train_workclass)

# Make predictions on the testing data
y_pred_bagging = bagging_classifier.predict(X_test_workclass)

# Define a gradient boosting classifier with logistic regression as base estimator
gb_classifier = GradientBoostingClassifier(base_estimator=LogisticRegression(), n_estimators=100)

# Train the gradient boosting classifier on the training data
gb_classifier.fit(X_train_all, y_train_all)

# Make predictions on the testing data
y_pred_gb = gb_classifier.predict(X_test_all)

# Evaluate the performance of both models
print(f"Bagging Accuracy: {accuracy_score(y_test_workclass, y_pred_bagging)}")
print(f"Gradient Boosting Accuracy: {accuracy_score(y_test_all, y_pred_gb)}")

By combining these techniques, we can improve the accuracy and robustness of our sex classification models.

Last modified on 2024-02-11