library(tidyverse) # data manipulation and plotting
library(janitor) # cleaning column names
library(recipes) # For data preprocessing
library(cluster) # clustering
library(factoextra) # k-means visualization and cluster evaluation
library(broom) # tidy output of clustering resultsTutorial 7: Descriptive Analytics in HRM
By the end of this tutorial, you should be able to:
- Explain why employee attrition matters as a business analytics problem and why segmentation can help an analytics team explore different attrition profiles.
- Construct a simple rule-of-thumb employee segmentation using fixed thresholds and interpret attrition patterns across the resulting groups.
- Explain what K-means clustering is trying to achieve and compare its strengths and limitations with a hand-built rule-of-thumb segmentation.
- Prepare numeric variables for K-means clustering by selecting, scaling, and checking a preprocessing workflow in R.
- Implement and interpret K-means clustering models using both a simple two-variable setup and a richer five-variable employee profile.
- Evaluate a defensible choice of the number of clusters using the elbow method and justify that choice in words.
- Communicate segmentation results in a short analytics team memo, including the main patterns, implications, and limitations of the analysis.
The Business Challenge
The Topic: Why are employees leaving, and can segmentation help us understand different attrition profiles?
Every time an employee leaves, companies incur real costs through recruitment, training, disruption to teams, and lost experience. For an HR analytics team, an important first step is to understand whether different types of employees show different patterns of attrition.
In this tutorial, we use segmentation to explore that question. We begin with a simple rule-of-thumb grouping based on a small number of variables, and then compare it to K-means clustering, a more data-driven way of grouping employees with similar characteristics. This helps us move beyond one-off cases and identify broader employee profiles, which is a common task in descriptive analytics.
Our goal is not to build a formal prediction model or identify the causes of attrition. Instead, the goal is to explore whether different segmentation approaches reveal useful patterns that an analytics team could investigate further.
The Data: HR Analytics Data
We will work with an anonymised dataset from a company’s HR department. It contains information on 1,470 employees, including:
- Personal characteristics:
age,gender,marital_status,distance_from_home - Job-related features:
job_satisfaction,monthly_income,years_at_company,department - Outcome:
attrition– whether the employee has left the company
This gives us a useful set of variables for exploring whether different employee groups differ in meaningful ways.
The Method: Rule-of-Thumb and K-Means Segmentation
Our goal is to segment employees into groups with similar profiles and examine whether attrition differs across those groups. To do this, we will:
- create a simple rule-of-thumb segmentation using two variables
- compare it to K-means clustering on those same variables
- build a richer K-means model using five scaled numeric variables
- use the elbow method to help choose a reasonable number of clusters
- interpret the resulting clusters using their average characteristics and attrition rates
This allows us to compare a simple hand-built segmentation to a richer data-driven one, and to ask whether the more flexible approach produces more useful insight.
Loading the R packages and data
Here’s the packages we will need to complete the exercises:
Let’s load the data that we will use today:
df <-
read_csv("data/hr_attrition.csv") |>
# tidy up column names to snake_case
clean_names()Prepare these Exercises before Class
Prepare these exercises before coming to class. Plan to spend 30 minutes on these exercises.
Exercise 1: The What and Why of employee attrition
Employee attrition refers to people leaving the company — either voluntarily (resignation) or involuntarily (termination). While some turnover is normal, high attrition can signal deeper issues.
(a) What kinds of consequences might high attrition have for a business?
(b) What might cause employees to leave?
(c) Why might an analytics team want to group employees into segments when studying attrition, rather than looking only at employees one by one?
Exercise 2: A Simple Rule-of-Thumb Employee Segmentation
Suppose your analytics team wants to create a quick first-pass classification of employees using two variables:
job_satisfactiondistance_from_home
For this exercise, define:
- lower satisfaction as a score below 2
- longer commute as living 10 km or more from work
This gives four possible employee groups:
- Higher satisfaction, shorter commute
- Higher satisfaction, longer commute
- Lower satisfaction, shorter commute
- Lower satisfaction, longer commute
(a) Why might an analytics team begin with a simple rule-of-thumb segmentation before trying a more data-driven method?
(b) Complete the code below to classify each employee into one of the four segments.
df_rule <-
df |>
mutate(
rule_segment = case_when(
YOUR_CONDITION ~ "Higher satisfaction, shorter commute",
YOUR_CONDITION ~ "Higher satisfaction, longer commute",
YOUR_CONDITION ~ "Lower satisfaction, shorter commute",
.default = "Lower satisfaction, longer commute"
)
)(c) For each segment, calculate:
- the number of employees
- the attrition rate
# Write your answer here(d) Which segment or segments appear to have the highest attrition?
(e) What are the advantages of this rule-of-thumb approach for an analytics team? What kinds of patterns might it fail to capture?
Exercise 3: What Is K-Means Clustering?
In Exercise 2, you created a simple rule-of-thumb segmentation by assigning employees to groups using fixed cutoffs.
In later exercises, you will explore a different approach called K-means clustering.
(a) Explain in your own words what K-means clustering is trying to achieve.
(b) Explain how K-means clustering differs from the rule-of-thumb segmentation you created in Exercise 2.
(c) What is one advantage of using a rule-of-thumb segmentation instead of K-means clustering?
(d) What is one advantage of using K-means clustering instead of a rule-of-thumb segmentation?
(e) In K-means clustering, which parts of the analysis are chosen by the analyst, and which parts are carried out automatically by the algorithm?
(f) Suppose an analytics team wants to use K-means clustering with variables such as job_satisfaction, distance_from_home, and monthly_income. Explain why it is important that these variables are numeric.
(g) Explain why it is usually important to scale variables before running K-means clustering.
In-Class Exercises
You will discuss these exercises in class with your peers in small groups and with your tutor. These exercises build from the exercises you have prepared above, you will get the most value from the class if you have completed those above before coming to class.
Exercise 4: Preparing the Data for K-Means Clustering
In the remaining exercises, we will use the following five numeric variables to build and compare K-means clustering models:
job_satisfactionmonthly_incomeyears_at_companydistance_from_homeage
In this exercise, we prepare these five variables for the clustering analysis that follows.
(a) Create a dataset that contains the five variables listed above.
df_kmeans <-
df |>
select(
job_satisfaction,
monthly_income,
years_at_company,
YOUR_CODE,
YOUR_CODE
)(b) Complete the code below to create a recipe for df_kmeans and add a step that normalises all numeric predictors.
rec <-
YOUR_CODE(~ ., data = df_kmeans) |>
YOUR_CODE(all_numeric_predictors())(c) Explain what the code in part (b) is doing.
(d) Complete the code below to implement the preprocessing steps from the recipe in part (b) and apply them to the data.
df_scaled <-
rec |>
YOUR_CODE() |>
YOUR_CODE(new_data = NULL)(e) Explain what the two functions you used in part (d) are doing.
(f) Run the code below to verify that the preprocessing achieves the desired result. What do you notice about the means and standard deviations?
df_scaled |>
summarise(
across(
everything(),
list(mean = mean, sd = sd)
)
)Exercise 5: Comparing Rule-of-Thumb Segmentation to K-Means
In Exercise 2, you created four employee groups using simple rules based on job_satisfaction and distance_from_home.
Now we will use K-means clustering on those same two variables and compare the result.
(a) Create a new dataset called df_scaled_small that keeps only the scaled versions of:
job_satisfactiondistance_from_home
from df_scaled.
df_scaled_small <-
df_scaled |>
select(YOUR_CODE, YOUR_CODE)(b) Run K-means clustering on df_scaled_small using 4 clusters.
set.seed(123)
kmeans_small <-
kmeans(
YOUR_CODE,
centers = YOUR_CODE,
nstart = 10
)(c) Use the code below to visualise the clustering result.
fviz_cluster(
kmeans_small,
data = df_scaled_small,
ggtheme = theme_minimal(),
main = "2-variable K-means clustering"
)(d) What do you notice about the shape of the clusters? Do they look very different from the rule-of-thumb groups in Exercise 2?
(e) The code below combines the original employee data with the K-means cluster assignments.
Using df_compare, calculate the number of employees and the attrition rate in each cluster.
df_compare <-
df |>
select(attrition, job_satisfaction, distance_from_home) |>
mutate(cluster = as.factor(kmeans_small$cluster))
df_compare |>
YOUR_CODE_HERE(f) Compared with the rule-of-thumb segmentation, did K-means appear to add much in this two-variable setting? Explain briefly.
Exercise 6: Building a Richer K-Means Model
Your analytics team now wants to build a richer clustering model using all five scaled variables in df_scaled:
job_satisfactionmonthly_incomeyears_at_companydistance_from_homeage
One challenge remains: we still need to choose a value for k, the number of clusters.
A colleague suggests using the elbow method as a starting point.
(a) Explain what the elbow method is trying to do.
(b) Run the code below to create an elbow plot for K-means clustering on df_scaled.
fviz_nbclust(
df_scaled,
kmeans,
method = "wss",
k.max = 10,
nstart = 10
)(c) Based on the elbow plot, what value of k would you choose? Briefly explain your answer.
(d) Run K-means clustering on df_scaled using your chosen value of k.
set.seed(123)
kmeans_full <-
kmeans(
YOUR_CODE,
centers = YOUR_CODE,
nstart = 10
)(e) Use the code below to visualise the clustering result.
fviz_cluster(
kmeans_full,
data = df_scaled,
ggtheme = theme_minimal(),
main = "5-variable K-means clustering"
)(f) What can the plot in (e) tell you about the clustering result? What can it not tell you on its own?
(g) The code below combines the original employee data with the cluster assignments from your 5-variable K-means model.
Using df_compare_full, calculate:
- the number of employees in each cluster
- the attrition rate in each cluster
- the average value of each clustering variable in each cluster
df_compare_full <-
df |>
select(
attrition,
job_satisfaction,
monthly_income,
years_at_company,
distance_from_home,
age
) |>
mutate(cluster = as.factor(kmeans_full$cluster))
df_compare_full |>
YOUR_CODE_HEREExercise 7: Communicating Your Results in an Analytics Team Memo
You have now explored three ways to segment employees:
- a simple rule-of-thumb segmentation
- a 2-variable K-means model
- a richer 5-variable K-means model
Your task is to write a short memo for an analytics team meeting. The goal is to help the team decide which approach produced the most useful insight into employee attrition patterns.
- Use the “Write Like an Amazonian” tips to guide your writing.
- A strong memo should compare the approaches, not just describe the final one.
- Focus on the analytical takeaway: what did each method reveal, and what did the richer model add?
- Keep the writing clear and direct. Do not simply restate the code you ran.
Use the outline below and provide 2–3 bullet points under each subheading.
Problem Statement (1 short paragraph summarising why the team is segmenting employees and what business problem this analysis is trying to address)
Approaches Compared
- Briefly describe the three segmentation approaches you explored
- Note one strength or weakness of each approach
Recommended Approach
- Which approach would you recommend the team use as a starting point?
- Why?
Key Employee Segments or Patterns
- What were the most important patterns in attrition risk?
- Did different approaches highlight different kinds of employee groups?
Proposed Implications
- What might these results imply for how the firm thinks about retention risk?
- Do different employee segments suggest different kinds of follow-up or intervention?
Limitations & Next Steps
- What are the main limitations of the analysis?
- What would you want the team to do next?