Assignment 3: Machine Learning Exercise

Course: EPPS – Knowledge Mining

Dataset: TEDS 2016 Survey Data

1. Review of ISLR Chapter 3 (Linear Regression)

Chapter 3 of Introduction to Statistical Learning (ISLR) focuses on linear regression, one of the most fundamental methods in statistical learning. Linear regression models the relationship between a dependent variable and one or more predictor variables by fitting a line that minimizes the difference between observed and predicted values.

A simple linear regression model can be written as:

Y = β₀ + β₁X + ε

where

Y is the dependent variable
X is the predictor variable
β₀ is the intercept
β₁ is the slope
ε represents random error

Chapter 3 explains several important concepts:

Simple Linear Regression – modeling the relationship between one predictor and the response variable.
Multiple Linear Regression – using multiple predictors to explain variation in the response variable.
Least Squares Estimation – estimating coefficients by minimizing the sum of squared errors.
Model Evaluation – using measures such as R², residual standard error, and hypothesis tests to assess model performance.

The chapter also discusses the limitations of linear regression, particularly when the response variable is categorical rather than continuous, which requires classification models such as logistic regression.

Reference used for review:
altaf-ali.github.io/ISLR/chapter3/solutions.html

2. Load the Dataset

The TEDS 2016 dataset was loaded using the haven package and the read_stata() function.

library(haven)

TEDS_2016 <- read_stata("https://raw.githubusercontent.com/datageneration/home/master/DataProgramming/data/TEDS_2016.dta")

The dataset contains survey responses related to voting behavior, demographics, and political attitudes.

3. Create a Regression Plot Function

A custom function called regplot was created to visualize the relationship between two variables using a regression line.

regplot=function(x,y){
  fit=lm(y~x)
  plot(x,y)
  abline(fit,col="red")
}

This function:

Fits a linear regression model using lm()
Creates a scatter plot of the data
Adds a regression line using abline()

4. Run Regression Plots

The regression plot function was applied using the dependent variable and three independent variables.

a. Age

regplot(TEDS_2016$age, TEDS_2016$votetsai)

This plot examines the relationship between age and vote choice.

b. Education

regplot(TEDS_2016$edu, TEDS_2016$votetsai)

This plot evaluates how education level relates to voting behavior.

c. Income

regplot(TEDS_2016$income, TEDS_2016$votetsai)

This plot explores the relationship between income level and vote choice.

5. What is the Problem?

The main problem is that the dependent variable (vote choice) is categorical, not continuous.

In this dataset, the dependent variable votetsai contains two categories:

0 = Other candidate
1 = Tsai

Because linear regression assumes a continuous dependent variable, applying a standard regression line is not ideal for predicting a binary outcome.

The scatterplots also show that the dependent variable only takes two values (0 and 1), which limits the usefulness of a linear regression line.

6. Why Does This Problem Occur?

Linear regression models assume:

The dependent variable is continuous
The relationship between variables is linear
Errors are normally distributed

However, when the dependent variable is binary, the predicted values from linear regression can fall outside the valid range (less than 0 or greater than 1). This makes the model unsuitable for classification problems.

7. How Can Prediction Be Improved?

Prediction of the dependent variable can be improved by using classification models instead of linear regression.

Better alternatives include:

Logistic Regression
- Designed for binary dependent variables
- Predicts probabilities between 0 and 1
Decision Trees
- Captures non-linear relationships
- Easy to interpret
Random Forest
- Combines many decision trees
- Improves prediction accuracy

In this analysis, three models were tested and compared. The results show that Random Forest achieved the highest prediction accuracy, followed by logistic regression and decision trees.

Figure 1: Comparison of prediction accuracy across machine learning models

Random Forest provides the highest accuracy.

The above figure shows the comparison of model performance.

The results indicate that ensemble machine learning methods such as Random Forest can provide better predictive performance than simple linear models when dealing with categorical outcomes.

Conclusion

This exercise demonstrates that although linear regression can be used to visualize relationships, it is not suitable for predicting a binary dependent variable. Because the outcome variable in this dataset has only two categories, classification models provide better predictive performance. Logistic regression, decision trees, and random forests were applied, and the results show that the Random Forest model achieved the highest prediction accuracy. This suggests that ensemble machine learning methods can improve prediction performance when modeling categorical outcomes.

R Code

# ============================================================================
# Workshop: Machine Learning with Survey Data (TEDS 2016)
# EPPS 6323 Knowledge Mining
# Karl Ho, University of Texas at Dallas
# ============================================================================

# ----------------------------------------------------------------------------
# 0. Setup: Install (if needed) and Load Packages
# ----------------------------------------------------------------------------

required_pkgs <- c("haven","tidyverse","GGally","cluster",
                   "rpart","rpart.plot","randomForest","caret","e1071")

installed <- rownames(installed.packages())

for(p in required_pkgs){
  if(!(p %in% installed)) install.packages(p)
}

library(haven)
library(tidyverse)
library(GGally)
library(cluster)
library(rpart)
library(rpart.plot)
library(randomForest)
library(caret)
library(e1071)

# ----------------------------------------------------------------------------
# Load Data
# ----------------------------------------------------------------------------

TEDS_2016 <- read_stata(
"https://raw.githubusercontent.com/datageneration/home/master/DataProgramming/data/TEDS_2016.dta"
)

# ============================================================================
# PART I: EXPLORATORY DATA ANALYSIS
# ============================================================================

# Data overview
dim(TEDS_2016)
names(TEDS_2016)
summary(TEDS_2016)

# Missing values
TEDS_2016 %>%
  summarise(across(everything(), ~sum(is.na(.)))) %>%
  pivot_longer(everything(), names_to="variable", values_to="n_missing") %>%
  filter(n_missing > 0) %>%
  arrange(desc(n_missing))

# Missing value plot
TEDS_2016 %>%
  summarise(across(everything(), ~sum(is.na(.)))) %>%
  pivot_longer(everything(), names_to="variable", values_to="n_missing") %>%
  filter(n_missing > 0) %>%
  ggplot(aes(x=reorder(variable,n_missing), y=n_missing)) +
  geom_col(fill="steelblue") +
  coord_flip() +
  labs(title="Missing Values by Variable", x=NULL, y="Count") +
  theme_minimal()

# ----------------------------------------------------------------------------
# Recode variables
# ----------------------------------------------------------------------------

teds <- TEDS_2016 %>%
  mutate(
    vote = factor(votetsai, levels=c(0,1), labels=c("Other","Tsai")),
    gender = factor(female, levels=c(0,1), labels=c("Male","Female")),
    Tondu = as.factor(Tondu),
    Party = as.factor(Party)
  ) %>%
  select(vote, gender, age, edu, income,
         Taiwanese, Econ_worse, Tondu, Party, DPP) %>%
  drop_na()

glimpse(teds)

# ----------------------------------------------------------------------------
# Visualization
# ----------------------------------------------------------------------------

# Vote distribution
ggplot(teds, aes(x=vote, fill=vote)) +
  geom_bar() +
  theme_minimal()

# Age distribution
ggplot(teds, aes(x=age)) +
  geom_histogram(binwidth=5, fill="steelblue", color="white") +
  theme_minimal()

# Vote by gender
ggplot(teds, aes(x=gender, fill=vote)) +
  geom_bar(position="fill") +
  theme_minimal()

# Age by vote
ggplot(teds, aes(x=vote, y=age, fill=vote)) +
  geom_boxplot() +
  theme_minimal()

# ----------------------------------------------------------------------------
# Correlation matrix
# ----------------------------------------------------------------------------

teds %>%
  select(age,edu,income,Taiwanese,Econ_worse,DPP) %>%
  cor(use="complete.obs") %>%
  as.data.frame() %>%
  rownames_to_column("var1") %>%
  pivot_longer(-var1,names_to="var2",values_to="cor") %>%
  ggplot(aes(var1,var2,fill=cor)) +
  geom_tile() +
  geom_text(aes(label=round(cor,2))) +
  scale_fill_gradient2(low="red",mid="white",high="blue") +
  theme_minimal()

# ----------------------------------------------------------------------------
# Pairs plot
# ----------------------------------------------------------------------------

teds %>%
  select(age,edu,income,Taiwanese,Econ_worse,vote) %>%
  ggpairs(aes(color=vote, alpha=0.5))

# ============================================================================
# PART II: UNSUPERVISED LEARNING (K-means)
# ============================================================================

teds_numeric <- teds %>%
  select(age,edu,income,Taiwanese,Econ_worse,DPP) %>%
  scale() %>%
  as.data.frame()

# Elbow method
wss <- numeric(10)

for(i in 1:10){
  wss[i] <- kmeans(teds_numeric, centers=i, nstart=10)$tot.withinss
}

plot(1:10, wss, type="b", pch=19,
     xlab="Number of Clusters (k)",
     ylab="Within-cluster Sum of Squares",
     main="Elbow Method")

# K-means clustering
set.seed(6323)
km_result <- kmeans(teds_numeric, centers=3, nstart=25)

table(km_result$cluster)

# Cluster profiling
teds$cluster <- as.factor(km_result$cluster)

teds %>%
  group_by(cluster) %>%
  summarise(
    n=n(),
    mean_age=mean(age),
    mean_edu=mean(edu),
    mean_income=mean(income)
  )

# ============================================================================
# PART III: SUPERVISED LEARNING
# ============================================================================

set.seed(6323)

train_index <- createDataPartition(teds$vote, p=0.7, list=FALSE)

train_data <- teds[train_index,] %>% select(-cluster)
test_data  <- teds[-train_index,] %>% select(-cluster)

# ----------------------------------------------------------------------------
# Logistic Regression
# ----------------------------------------------------------------------------

logit_model <- glm(vote ~ age + gender + edu + income +
                     Taiwanese + Econ_worse + DPP,
                   data=train_data,
                   family=binomial)

summary(logit_model)

logit_probs <- predict(logit_model, newdata=test_data, type="response")

logit_pred <- ifelse(logit_probs > 0.5, "Tsai", "Other")
logit_pred <- factor(logit_pred, levels=c("Other","Tsai"))

confusionMatrix(logit_pred, test_data$vote)

# ----------------------------------------------------------------------------
# Decision Tree
# ----------------------------------------------------------------------------

tree_model <- rpart(vote ~ age + gender + edu + income +
                      Taiwanese + Econ_worse + DPP,
                    data=train_data,
                    method="class")

rpart.plot(tree_model)

tree_pred <- predict(tree_model, newdata=test_data, type="class")

confusionMatrix(tree_pred, test_data$vote)

# ----------------------------------------------------------------------------
# Random Forest
# ----------------------------------------------------------------------------

set.seed(6323)

rf_model <- randomForest(vote ~ age + gender + edu + income +
                           Taiwanese + Econ_worse + DPP,
                         data=train_data,
                         ntree=500,
                         importance=TRUE)

rf_pred <- predict(rf_model, newdata=test_data)

confusionMatrix(rf_pred, test_data$vote)

# ----------------------------------------------------------------------------
# Variable Importance
# ----------------------------------------------------------------------------

importance_df <- as.data.frame(importance(rf_model)) %>%
  rownames_to_column("variable")

ggplot(importance_df,
       aes(x=reorder(variable, MeanDecreaseGini),
           y=MeanDecreaseGini)) +
  geom_col(fill="steelblue") +
  coord_flip() +
  theme_minimal()

# ----------------------------------------------------------------------------
# Model Comparison
# ----------------------------------------------------------------------------

results <- tibble(
  Model=c("Logistic Regression","Decision Tree","Random Forest"),
  Accuracy=c(
    confusionMatrix(logit_pred, test_data$vote)$overall["Accuracy"],
    confusionMatrix(tree_pred, test_data$vote)$overall["Accuracy"],
    confusionMatrix(rf_pred, test_data$vote)$overall["Accuracy"]
  )
)

print(results)

ggplot(results,
       aes(x=reorder(Model,Accuracy),
           y=Accuracy,
           fill=Model)) +
  geom_col() +
  coord_flip() +
  labs(title="Model Comparison", x=NULL, y="Accuracy") +
  theme_minimal()