library(tidyverse)
library(scales)
library(ggokabeito)Tutorial 9: Predictive Analytics for Workforce Planning
By the end of this tutorial, you should be able to:
- Explore how attendance varies with game time (day/night) and opponent using tidy plots.
- Build time-based train/test splits and create cell-means predictors on the training set.
- Evaluate forecasts on the test set using MAE, RMSE, and MAPE.
- Fit and compare regression models (Basic OLS, Quadratic in game number, Degree-8 polynomial) and relate fit to generalisation.
- Connect forecast accuracy to operational planning decisions.
The Business Challenge
Getting set up
First, we need to load our packages:
Next, we need to load our dataset.
bos_2005_2015_fenway <-
read_csv('data/mlb_attendance.csv')Prepare these Exercises before Class
Prepare these exercises before coming to class. Plan to spend 30 minutes on these exercises.
Exercise 1: Building predictors available in advance from the training set
In the lecture, the lag-1 model performed well because recent attendance is informative. But it is not always useful for workforce planning: concession managers often need to roster staff before the previous game has been played. In this tutorial we therefore focus on predictors that are known before the game:
- whether the game is during the day or evening
- the day of the week
- the opponent
Our goal is to use the training data to decide whether these forward-looking features contain useful predictive signal, then turn them into simple test-set forecasts.
Overfitting happens when a model learns patterns that are too specific to the data used to build it. In practice, this means the model is fitting idiosyncratic variation in the training data rather than meaningful structure. Idiosyncratic variation is not systematic, so we should not expect it to reoccur in future games in the same way. The model may therefore look accurate on familiar data, but perform poorly when asked to predict new games.
A training/test split helps us check for this. We build the model using the training set, then evaluate it on a separate test set that was not used to calculate the model. If the model works well on the test set, we have better evidence that it has learned a useful pattern rather than memorised noise.
Because this is a forecasting problem, we split by time: earlier seasons are used for training, and later seasons are held out for testing. This mirrors how managers would actually use the model: learn from the past, then forecast future games.
(a). Create a time-based training and test split. Use games from seasons before 2013 as the training set, and games from 2013 onward as the test set.
training_sample_2005_2012 <-
bos_2005_2015_fenway |>
YOUR_CODE_HERE
test_sample_2013_2015 <-
bos_2005_2015_fenway |>
YOUR_CODE_HERE Why is a time-based split more realistic than randomly splitting games for this forecasting problem?
(b). Start with the simplest forward-looking feature: whether the game is played during the day or at night. Complete and run the code below, then describe the size and direction of the difference in attendance across day vs night games.
training_sample_2005_2012 |>
YOUR_CODE_HERE(day_night) |>
YOUR_CODE_HERE(
ave_attendance = mean(attendance, na.rm = TRUE),
n_games = n(),
.groups = "drop"
) |>
ggplot(aes(x = day_night, y = ave_attendance, fill = day_night)) +
geom_col(width = 0.5) +
geom_text(
aes(label = round(ave_attendance, 0)),
vjust = -0.5, size = 4, color = "black"
) +
labs(x = "Game Time", y = "Average Attendance", fill = "Game Time") +
scale_y_continuous(breaks = scales::pretty_breaks(n = 5)) +
ggokabeito::scale_fill_okabe_ito() +
coord_cartesian(ylim = c(30000, 40000)) +
theme_minimal() +
theme(panel.grid.major.x = element_blank())(c). Now inspect two richer versions of the same idea: attendance across the different day-of-week/game-time combinations, and attendance across the different opponents. Run the code below.
training_sample_2005_2012 |>
mutate(
dotw = factor(
dotw,
levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")
)
) |>
group_by(dotw, day_night) |>
summarise(
ave_attendance = mean(attendance, na.rm = TRUE),
n_games = n(),
.groups = "drop"
) |>
ggplot(aes(x = dotw, y = ave_attendance, fill = day_night)) +
geom_col(position = position_dodge(width = 0.7), width = 0.6) +
ggokabeito::scale_fill_okabe_ito(name = "Game Time") +
labs(x = "Day of the Week", y = "Average Attendance") +
scale_y_continuous(breaks = scales::pretty_breaks(n = 5)) +
coord_cartesian(ylim = c(30000, 40000)) +
theme_minimal() +
theme(
panel.grid.major.x = element_blank(),
axis.text.x = element_text(angle = 45, hjust = 1)
)
training_sample_2005_2012 |>
group_by(opp) |>
summarise(
ave_attendance = mean(attendance, na.rm = TRUE),
n_games = n(),
.groups = "drop"
) |>
arrange(desc(ave_attendance)) |>
mutate(opp = factor(opp, levels = opp)) |>
ggplot(aes(x = opp, y = ave_attendance)) +
geom_col(fill = "steelblue", width = 0.5) +
labs(x = "Opponent", y = "Average Attendance") +
scale_y_continuous(breaks = scales::pretty_breaks(n = 5)) +
coord_cartesian(ylim = c(30000, 40000)) +
theme_minimal() +
theme(
panel.grid.major.x = element_blank(),
axis.text.x = element_text(size = 8, angle = 45, hjust = 1)
)Briefly answer the following:
- Which feature appears most useful for forecasting attendance?
- Which feature would be easiest to explain to a concessions manager?
- Which feature is most likely to be noisy because it creates small groups?
(d). Build three cell-means models from the training set, then join those training means to the test set as predictions.
A cell-means model predicts using the average outcome within a group, or “cell”, in the training data. For example, a day/night cell-means model calculates the average training-set attendance for day games and the average training-set attendance for night games. It then uses those averages as predictions for future day and night games.
This kind of model is simple and explainable: “games like this usually draw about this many fans.” The tradeoff is that more detailed (i.e., narrower) cells can become noisy.
Complete the missing pieces in the code below to fit these cell-means models.
average_attendance_by_daynight <-
training_sample_2005_2012 |>
group_by(YOUR_CODE_HERE) |>
summarise(
ave_attendance = mean(attendance, na.rm = TRUE),
.groups = "drop"
)
average_attendance_by_dotw_dn <-
training_sample_2005_2012 |>
group_by(YOUR_CODE_HERE, YOUR_CODE_HERE) |>
summarise(
ave_attendance = mean(attendance, na.rm = TRUE),
.groups = "drop"
)
average_attendance_by_opp <-
training_sample_2005_2012 |>
group_by(YOUR_CODE_HERE) |>
summarise(
ave_attendance = mean(attendance, na.rm = TRUE),
.groups = "drop"
)
test_sample_2013_2015 <-
test_sample_2013_2015 |>
left_join(average_attendance_by_daynight, by = YOUR_CODE_HERE) |>
rename(predicted_attendance_daynight = ave_attendance) |>
left_join(average_attendance_by_dotw_dn, by = c(YOUR_CODE_HERE, YOUR_CODE_HERE)) |>
rename(predicted_attendance_dotw_dn = ave_attendance) |>
left_join(average_attendance_by_opp, by = YOUR_CODE_HERE) |>
rename(predicted_attendance_opp = ave_attendance)(e). Each model uses information available before the game. Before calculating formal error metrics, which one do you expect to perform best on the test set, and why? In your answer, distinguish between a feature that looks useful in the training data and a model that is likely to generalise well to future games.
Exercise 2: Evaluating predictors available in advance in the test set
The lecture introduced MAE, RMSE, and MAPE. Rather than rebuild the same error pipeline three separate times, we will use one tidy workflow to compare all three models with predictors available in advance from Exercise 1.
(a). Run the code below to put the three prediction columns into a long format. Before doing so complete the missing lines that calculate forecast errors.
simple_model_errors <-
test_sample_2013_2015 |>
select(
attendance,
predicted_attendance_daynight,
predicted_attendance_dotw_dn,
predicted_attendance_opp
) |>
pivot_longer(
cols = starts_with("predicted_attendance_"),
names_to = "model",
values_to = "predicted_attendance"
) |>
mutate(
model = recode(
model,
predicted_attendance_daynight = "Day/Night",
predicted_attendance_dotw_dn = "Day of week x Day/Night",
predicted_attendance_opp = "Opponent"
),
error = YOUR_CODE_HERE - YOUR_CODE_HERE,
abs_error = abs(YOUR_CODE_HERE),
squared_error = YOUR_CODE_HERE ^ 2,
percentage_error = (YOUR_CODE_HERE / YOUR_CODE_HERE) * 100
)What does pivot_longer() do here, and why is it helpful when we want to compare several models?
(b). Complete and run the code below to calculate MAE, RMSE, and MAPE for each model.
simple_model_metrics <-
simple_model_errors |>
group_by(model) |>
summarise(
MAE = mean(YOUR_CODE_HERE, na.rm = TRUE),
RMSE = sqrt(mean(YOUR_CODE_HERE, na.rm = TRUE)),
MAPE = mean(abs(YOUR_CODE_HERE), na.rm = TRUE),
.groups = "drop"
)Which metric is most directly useful for setting a staffing buffer? Which metric is most sensitive to occasional large misses?
(c). Run the code below to plot the metrics.
simple_metrics_long <-
simple_model_metrics |>
pivot_longer(
cols = c(MAE, RMSE, MAPE),
names_to = "metric",
values_to = "value"
)
ggplot(simple_metrics_long, aes(x = model, y = value, fill = model)) +
geom_col() +
geom_text(aes(label = round(value, 0)), vjust = -0.5, size = 3) +
facet_wrap(~ metric, scales = "free_y") +
ggokabeito::scale_fill_okabe_ito() +
labs(
title = "Forecast Accuracy Metrics by Model",
x = "Model",
y = "Value"
) +
theme_minimal() +
theme(
legend.position = "none",
axis.text.x = element_text(angle = 45, hjust = 1)
)(d). Based on the table and plot, choose one model for a concessions staffing manager. In your answer, consider:
- test-set accuracy
- whether the predictor is available early enough for rostering
- whether the model is simple enough to explain
- whether the accuracy differences are large enough to matter operationally
In-Class Exercises
You will discuss these exercises in class with your peers in small groups and with your tutor. These exercises build from the exercises you have prepared above; you will get the most value from the class if you have completed those above before coming to class.
Exercise 3: Fitting regression models
Exercises 1 and 2 used simple cell-means models: each forecast came from the average attendance for one group of games. We now turn to ordinary least squares (OLS) regression, which, among other things, lets us combine multiple predictors in a single model.
The aim is not to find the most complicated model possible. The aim is to ask whether a richer model improves the forecast enough to justify the extra complexity for staffing decisions.
Ordinary least squares (OLS) regression is a way to model the relationship between an outcome we want to predict and one or more predictors. Here, the outcome is attendance, and the predictors might include opponent, game time, day of week, or game number.
OLS works by choosing the line, curve, or set of group differences that makes the model’s prediction errors as small as possible in the training data. More precisely, it chooses the coefficients that minimise the sum of squared errors: the squared differences between the actual attendance values and the model’s fitted values.
For example, a simple regression might estimate how attendance changes between day and night games. A richer regression can combine several predictors at once, such as opponent, day/night, and day of week. When a predictor is categorical, such as opponent, OLS compares each category to a reference category.
In this tutorial, we use OLS as a forecasting tool. We fit the model on earlier seasons, use it to predict attendance in later seasons, and then evaluate whether those predictions are accurate enough and explainable enough to support staffing decisions.
(a). Fit an OLS model that uses the same forward-available predictors we studied earlier: opponent, day/night, and day of week. Complete and run the code below.
basic_ols_model <-
lm(YOUR_CODE_HERE ~ YOUR_CODE_HERE + YOUR_CODE_HERE + YOUR_CODE_HERE,
data = YOUR_CODE_HERE
)(b). Run the code below to view the model summary.
summary(basic_ols_model)Discuss with your group:
- Which predictors appear to have the largest associations with attendance?
- What does the R-squared value tell us about in-sample fit?
- Why should we avoid choosing the model on R-squared alone?
(c). The basic OLS model combines categorical predictors. We can also ask whether attendance changes systematically over the course of the season. Run the two plots below to compare a simple quadratic curve with a more flexible degree-8 curve.
ggplot(training_sample_2005_2012, aes(x = gm, y = attendance)) +
geom_point(alpha = 0.6) +
stat_smooth(
method = "lm",
formula = y ~ x + I(x^2),
color = "blue",
se = FALSE
) +
labs(
title = "Quadratic Fit of Attendance by Game Number",
x = "Game Number",
y = "Attendance"
) +
theme_minimal()
ggplot(training_sample_2005_2012, aes(x = gm, y = attendance)) +
geom_point(alpha = 0.6) +
stat_smooth(
method = "lm",
formula = y ~ x + I(x ^ 2) + I(x ^ 3) + I(x ^ 4) +
I(x ^ 5) + I(x ^ 6) + I(x ^ 7) + I(x ^ 8),
color = "blue",
se = FALSE
) +
labs(
title = "Degree-8 Polynomial of Attendance by Game Number",
x = "Game Number",
y = "Attendance"
) +
theme_minimal()Which curve looks more likely to generalise to future games, and why?
(d). Let’s formally model the plots above. Run the code below to fit two additional OLS specifications: a simple quadratic model using game number, and a more flexible specification that combines opponent, day/night, day of week, and a degree-8 polynomial in game number.
non_linear_ols_model <-
lm(attendance ~ gm + I(gm^2),
data = training_sample_2005_2012
)
eighth_poly_ols_model <-
lm(attendance ~ opp + day_night + dotw + gm + I(gm^2) +
I(gm^3) + I(gm^4) + I(gm^5) + I(gm^6) + I(gm^7) + I(gm^8),
data = training_sample_2005_2012
)(e). Complete and run the code below to compare in-sample R-squared across the three OLS specifications.
ols_fit_summary <- tibble(
model = c("Basic OLS", "Quadratic game number", "Degree-8 polynomial OLS"),
r_squared = c(
summary(YOUR_CODE_HERE)$r.squared,
summary(YOUR_CODE_HERE)$r.squared,
summary(YOUR_CODE_HERE)$r.squared
)
)
ols_fit_summaryWhich model fits the training data best? Why is this not enough to choose the forecasting model? In your answer, consider fit, complexity, and operational usefulness.
Exercise 4: Evaluating regression models
Now that we have fitted our OLS models, we need to test whether the extra complexity actually improves forecasts for future games. This mirrors Exercise 2: train on earlier seasons, evaluate on later seasons, and choose based on usefulness for staffing rather than in-sample fit alone.
(a). Complete and run the setup code below. It creates a test sample that contains only opponent levels observed in the training data, then generates test-set predictions from the three OLS models.
known_opp_levels <-
training_sample_2005_2012 |>
pull(opp) |>
unique()
safe_test_sample <-
test_sample_2013_2015 |>
filter(opp %in% known_opp_levels)
safe_test_sample <-
safe_test_sample |>
mutate(
predicted_attendance_basic_OLS =
predict(YOUR_CODE_HERE, newdata = safe_test_sample),
predicted_attendance_non_linear_OLS =
predict(YOUR_CODE_HERE, newdata = safe_test_sample),
predicted_attendance_eighth_poly_OLS =
predict(YOUR_CODE_HERE, newdata = safe_test_sample)
)Why do we need to handle opponent levels carefully before using predict() on the test set? What would happen if we didn’t first trim our test sample in this way?
(b). Run the code below, then inspect the resulting ols_model_errors data frame.
ols_model_errors <-
safe_test_sample |>
select(
attendance,
predicted_attendance_basic_OLS,
predicted_attendance_non_linear_OLS,
predicted_attendance_eighth_poly_OLS
) |>
pivot_longer(
cols = starts_with("predicted_attendance_"),
names_to = "model",
values_to = "predicted_attendance"
) |>
mutate(
model = recode(
model,
predicted_attendance_basic_OLS = "Basic OLS",
predicted_attendance_non_linear_OLS = "Quadratic game number",
predicted_attendance_eighth_poly_OLS = "Degree-8 polynomial OLS"
),
error = attendance - predicted_attendance,
abs_error = abs(error),
squared_error = error ^ 2,
percentage_error = (error / attendance) * 100
)What does this code do? Why is each row now a model-game combination rather than just a game? Why do we create abs_error, squared_error, and percentage_error?
(c). Now use ols_model_errors to create ols_model_metrics, a table like the one below. Your output should show the values for each cell, e.g., each model’s MAE, RMSE, etc.
| model | MAE | RMSE | MAPE |
|---|---|---|---|
| Quadratic game number | |||
| Degree-8 polynomial OLS | |||
| Basic OLS |
Write the code to produce this table:
# Write your answer here(d). Interpret the table from part (c).
Which OLS model performs best out of sample? Why doesn’t this match the ranking by training-set R-squared from Exercise 3?
Exercise 5: Choosing and using a staffing forecast
The final decision is not just “which OLS model won?” We also have the simple cell-means models from Exercise 2. Before comparing them, we need to check that the models were evaluated on the same games.
(a). In Exercise 2, the simple cell-means models were evaluated on test_sample_2013_2015. In Exercise 4, the OLS models were evaluated on safe_test_sample.
Run the code below to compare the two samples.
tibble(
sample = c("Full test sample", "OLS-safe test sample"),
n_games = c(
nrow(test_sample_2013_2015),
nrow(safe_test_sample)
)
)
test_sample_2013_2015 |>
filter(!(opp %in% known_opp_levels)) |>
count(opp, sort = TRUE)What is the sample mismatch?
(b). The table below fixes the sample mismatch from part (a). It compares the best simple cell-means model with the best OLS model, using only the same OLS-safe test rows.
| Model family | Best model in family | MAE (fans) | RMSE (fans) | MAPE (%) |
|---|---|---|---|---|
| Cell-means | Day/Night | 1691.2 | 2412.5 | 5.0 |
| OLS | Quadratic game number | 1673.9 | 2396.4 | 5.0 |
What does this table compare? Why is this comparison fair, while a direct comparison between simple_model_metrics from Exercise 2 and ols_model_metrics from Exercise 4 was not fair?
(c). Interpret the table in part (b).
Which model has the lower MAE on the same test rows? Is the difference large enough to matter for a concessions staffing decision? Would RMSE or MAPE change your conclusion?
(d). Recommend a model and explain how the manager should use it.
In your answer, include:
- which model you choose
- the evidence from the same-sample metrics
- why the extra complexity is or is not worth it
- how the forecast becomes a baseline roster
- how MAE would be used as a staffing buffer and monitored over time
(e). A good working model still needs maintenance. Explain when and why the manager should update or replace the model.
In your answer, consider:
- what the manager should monitor over time
- what changes might make the model stale
- how a major change in the environment could affect forecast accuracy