library(tidyverse) # collection of packages for data manipulation and visualizationTutorial 4: Data Wrangling for Business Analytics
By the end of this tutorial, you should be able to:
- Recognize and choose appropriate
dplyrverbs for filtering rows, creating variables, selecting columns, ranking observations, and summarising groups.
- Create and verify key strategy variables by using
filter()andmutate()to constructprofit_marginandasset_turnover, then checking whether the results are sensible.
- Compare firms on different dimensions of performance by identifying those with the highest profit margins and those with the highest asset turnover.
- Analyse industry-level patterns by using
group_by()andsummarise()to calculate and interpret grouped summaries.
- Evaluate the limits of summary measures by explaining why averages may hide important differences within industries.
- Synthesize results for decision-makers by communicating the main findings in a short executive-style summary.
The Business Challenge
The Topic: Understanding firm strategy in the Australian economy
Firms can generate strong results in different ways. One firm may earn a lot of profit from each dollar of sales. Another may earn less on each sale, but generate very high sales relative to the assets it uses. These are different ways of operating, and they often reflect differences in business model and industry structure.
In this tutorial, we focus on two measures:
\[ \mathrm{Profit\ Margin} = \frac{\mathrm{EBIT}}{\mathrm{Sales}} \]
\[ \mathrm{Asset\ Turnover} = \frac{\mathrm{Sales}}{\mathrm{Assets}} \]
These measures help us think about two different dimensions of performance:
- Profit margin tells us how much profit a firm earns from each dollar of sales
- Asset turnover tells us how effectively a firm turns its asset base into sales
Using dplyr, we will create these variables, check that they were calculated correctly, and then use them to answer a series of business questions.
In particular, we will ask:
- Which firms appear strongest on profit margin, and which appear strongest on asset turnover?
- Do the same firms appear at the top of both rankings, or do different firms succeed in different ways?
- When we move from individual firms to industries, do the same patterns still appear?
- Do some industries seem to rely more on high margins, while others rely more on high turnover?
- What do these patterns suggest about how executives should interpret firm performance?
By the end of the tutorial, you should be able to use dplyr to turn raw accounting data into a short executive-style summary of how firms and industries generate strong results.
The Data: Financial Statement Observations from Yahoo Finance
To answer these questions, we draw on audited 2024 financial statements for the 200 largest ASX-listed firms (as measured by assets).
Financial statements provide a standardised snapshot of a company’s performance and financial health. In our data set, the key variables are:
- Company name (
conml) – the name of the firm as listed on the ASX
- Industry (
industry) – the industry classification, allowing comparison across sectors such as mining, banking, or healthcare
- Profit (
ebit) – earnings before interest and taxes, representing core business profit after expenses (from the income statement)
- Assets (
at) – the total resources the company owns that have economic value (from the balance sheet)
- Invested Capital (
invested_capital) – the total capital invested in the company by both shareholders and debt holders to fund operations and growth. It generally includes:
By working with these statements and variables, we can replicate the kind of assessments that professional analysts and fund managers use to guide billion-dollar investment decisions.
As you will see, just a few lines of R code can wrangle raw financial statements information into orderly data summaries that allow us to understand how businesses across different industry sectors perform across various metrics.
R packages for today
Prepare these Exercises before Class
Prepare these exercises before coming to class. Plan to spend 45 minutes on these exercises.
eval flag when you want to evaluate code!
In the R code chunks below we have provided starter code for you to work from. We have set the key eval to the value false so that they are not run because they have syntax such as YOUR_VALUE_HERE which would generate errors.
Switch the eval value to true when you want the R code within a chunk to be run when you compile your document.
Exercise 1: Meet the Main dplyr Verbs
In this tutorial, we will use a small set of dplyr verbs to build variables, inspect data, rank firms, and compare industries.
Before we start coding, it helps to know what each verb is designed to do.
(a). Match each dplyr verb below to its plain-English meaning.
| Verb | Meaning |
|---|---|
filter() |
A. Create a new variable |
mutate() |
B. Sort the rows |
select() |
C. Keep only the rows that match a condition |
arrange() |
D. Keep only the columns you want |
slice_max() |
E. Keep the top rows based on a variable |
group_by() |
F. Split the data into groups |
summarise() |
G. Collapse many rows into summary measures |
(b). For each business task below, write the dplyr verb that best fits the task.
- Keep only firms with positive sales and positive assets
- Create a new variable called
profit_margin - Show only company name, sales, and profit margin
- Sort firms from highest to lowest profit margin
- Keep the five firms with the highest asset turnover
- Set up the data so that each industry is analysed separately
- Calculate the average ROIC for each industry
Exercise 2: Understanding the Pipe Operator, |>
In this exercise, we use two animals — a lion and a giraffe — to explore how the pipe operator, |>, works. This lets us focus on the order of the steps without needing to think about data yet.
Suppose we start with an object called lion. We want the lion to do the following actions in order:
- stand up
- lift its head by 30 degrees
- roar at volume 10
(a). One way to write this is to create a new object after each step.
lion_step1 <- stand_up(lion)
lion_step2 <- lift_head(lion_step1, angle = 30)
lion_step3 <- roar(lion_step2, volume = 10)What does lion_step2 represent?
(b). We could also write the same sequence by placing one function inside another.
roar(lift_head(stand_up(lion), angle = 30), volume = 10)This version is shorter, but it can be harder to read because we have to work from the inside out.
Which function runs first in this version?
(c). The pipe operator lets us write the same sequence in the order the actions happen.
lion |>
stand_up() |>
lift_head(angle = 30) |>
roar(volume = 10)In the line lift_head(angle = 30), what is being piped in, and what is the extra argument?
(d). Now try this yourself.
Suppose we start with an object called giraffe. We want the giraffe to do the following actions in order:
- step forward by 2 steps
- lower its neck by 45 degrees
- drink water for 5 seconds
Write this sequence in the three ways shown above.
Step-by-step with intermediate objects
# Write your answer hereFunction inside function
# Write your answer hereWith the pipe
# Write your answer here(e). Complete the sentence below:
The pipe operator takes __________ and passes it into __________.
(f). In one sentence, explain how the pipe helps us write code in an “and then” way.
Exercise 3: Create the Analysis Dataset
In the rest of this tutorial, we will work with a dataset called firm_strategy. To create firm_strategy, we need to construct some variables from the ASX dataset.
(a). Load the ASX data located in data/asx_200_2024.csv into an object called asx_data.
asx_data <- YOUR_CODE_HERE(b). We now want to create a new dataset called firm_strategy from asx_data.
To do this, we need to:
keep only firms for which the ratios can be calculated sensibly, and
create two new variables:
profit_marginasset_turnover
Before writing any code, look at the formulas for these variables and think about which original variables must be strictly positive.
(c). Use your answer to complete the code below. Replace YOUR_CONDITIONS and fill in the two missing ratio definitions.
firm_strategy <-
asx_data |>
filter(YOUR_CONDITIONS) |>
YOUR_CODE_HERE(
profit_margin = YOUR_CODE_HERE,
asset_turnover = YOUR_CODE_HERE
)(d). In one sentence, explain why we use filter() before mutate() in our solution to (c).
Exercise 4: Create a Smaller Checking Dataset
We have now created a new dataset called firm_strategy. Before moving on, it is good practice to check that the new variables were created correctly. One useful way to do this is to create a smaller dataset that contains only the variables we need for checking.
(a). Create a new object called firm_strategy_check that keeps only the following variables:
conmlebitatsaleprofit_marginasset_turnover
firm_strategy_check <-
firm_strategy |>
YOUR_FUNCTION(YOUR_VARIABLES)(b). Display the first 10 rows of firm_strategy_check. You may find the command slice_head(n = 10) helpful here.
firm_strategy_check |>
YOUR_CODE(c). Why might it be useful to create a smaller checking dataset instead of inspecting the full firm_strategy dataset?
(d). Looking at a few rows is a useful first step, but it might not scale well to a large dataset where we compute hundreds of variables. What might be a more scalable next step if you wanted to check whether profit_margin and asset_turnover look reasonable across all firms?
In-Class Exercises
You will discuss these exercises in class with your peers in small groups and with your tutor. These exercises build from the exercises you have prepared above, you will get the most value from the class if you have completed those above before coming to class.
Exercise 5: Which Firms Rely Most on Margin? Which Rely Most on Volume?
Some firms generate strong profits from each dollar of sales. Others generate large sales from a given asset base. In this exercise, we compare firms at the top of each of these two dimensions.
(a). Use slice_max() to identify the five firms with the highest profit_margin.
Your table should include:
conmlindustryebitatsaleprofit_marginasset_turnover
firm_strategy |>
slice_max(YOUR_CODE_HERE) |>
select(YOUR_VARIABLES_HERE)(b). Now produce the equivalent table for the five firms with the highest asset_turnover.
firm_strategy |>
slice_max(YOUR_CODE_HERE) |>
select(YOUR_VARIABLES_HERE)(c). Compare the two tables.
- Do the same firms appear in both tables?
- What kinds of industries seem more common in each table?
- What does this suggest about different ways firms generate profits?
Exercise 6: Comparing Industries Using Average Values
In Exercise 5, you compared individual firms at the top of the profit margin and asset turnover distributions. We now take a step back and look at broader patterns across industries. To do this, we will group the data by industry and calculate average values for each group.
(a). Complete the code below to create a summary table that shows, for each industry:
- the number of firms
- average
profit_margin - average
asset_turnover
industry_summary <-
firm_strategy |>
group_by(YOUR_VARIABLE) |>
summarise(
n_firms = n(),
avg_profit_margin = mean(YOUR_VARIABLE, na.rm = TRUE),
avg_asset_turnover = YOUR_CODE
)(b). Now focus only on industries with at least 3 firms, and sort industry_summary from highest to lowest average profit_margin.
industry_summary |>
filter(YOUR_CONDITION) |>
arrange(YOUR_CODE)(c). Which industries appear near the top of the table when sorted by average profit_margin? What does this suggest about how firms in these industries tend to make money?
(d). Again focusing only on industries with at least 3 firms, sort industry_summary from highest to lowest average asset_turnover.
industry_summary |>
YOUR_CODE |>
YOUR_CODE(e). Compare the two rankings. How do these industry-level patterns comapre to what you saw earlier in Exercise 5, where different kinds of firms appeared in the high-margin and high-turnover tables?
(f). In this exercise, we focused on industries with at least 3 firms. Why might that be useful? And even after doing this, why might an average still give a slightly misleading picture of an industry?
(g). In this exercise, we focused on industry averages. Another summary measure analysts often use is the standard deviation, which tells us how spread out the firms in an industry are. What do you think it would mean if an industry had:
- a high standard deviation in
profit_margin? - a low standard deviation in
profit_margin?
Exercise 7: Summarising the Findings for Executives
Imagine you work at a large consulting company that is preparing an annual report on the financial results of Australian firms.
Your team has been asked to analyse how large Australian listed firms generate strong results across different industries. As part of this project, you need to present an interim result to the executives overseeing the report.
Write 3–4 sentences that could appear in the interim report’s executive summary.
Your summary should answer the following:
- What did you analyse?
- What were the main patterns you found?
- What is the main business lessons one should take from these results?
Write for a non-technical audience and focus on the business meaning of the results.
Exercise 8 [Optional]: Visualising the Profit Margin – Asset Turnover Relationship
In Exercises 5 and 6, you identified that firms and industries tend to perform strongly on either profit margin or asset turnover, but rarely both. In this exercise, you will visualise that pattern using a scatter plot.
(a). Using industry_summary, create a scatter plot with avg_profit_margin on the x-axis and avg_asset_turnover on the y-axis.
industry_summary |>
ggplot(aes(x = YOUR_VARIABLE,
y = YOUR_VARIABLE)
) +
YOUR_CODE() (b). Does the plot look the way you expected? Is anything making it hard to read?
(c). Filter industry_summary to keep only industries with at least 3 firms and an average profit margin above -5, then recreate the scatter plot from (a).
industry_summary |>
filter(YOUR_CONDITIONS) |>
ggplot(aes(x = YOUR_VARIABLE,
y = YOUR_VARIABLE)
) +
YOUR_CODE() (d). Add a geom_smooth() layer to the plot from (b).
industry_summary |>
filter(YOUR_CONDITIONS) |>
ggplot(aes(x = YOUR_VARIABLE,
y = YOUR_VARIABLE)
) +
YOUR_CODE() +
YOUR_CODE(method = "lm",
se = FALSE)(e). What does the line suggest about the relationship between average profit margin and average asset turnover across industries?