Tutorial 5: Shaping and Joining Data for Business Analytics

Learning Goals

By the end of this tutorial, you should be able to:

Recognize the difference between tidy and messy data, and use pivot_longer() to tidy up wide data.
Summarize and visualise financial data, and explain why tidy formats make analysis easier.
Plot and interpret relationships between share price and earnings per share (EPS).
Combine data from multiple tables using joins (left_join(), inner_join(), anti_join()), and understand what each join includes or excludes.

The Business Challenge

The Topic: How well do Australian firms’ earnings explain what investors are willing to pay for these companies?

Share prices of Australia’s biggest companies go up and down every day. These changes reflect how investors feel about the company’s future earnings — based on news, forecasts, and the broader economy.

Two key numbers help us understand this:

Earnings per share (EPS), tells us how much profit a company earns for each share:

\[ \text{EPS} = \frac{\text{Profit}}{\text{Shares Outstanding}} \]

The Price-to-Earnings (P/E) Ratio shows how much investors are willing to pay for $1 of earnings:

\[ \text{P/E} = \frac{\text{Share Price}}{\text{EPS}} \]

To tie these measures back to our earlier material on firm’s Earnings Before Interest and Taxes (EBIT), we can define firm’s earnings that are ultimately paid out to shareholders as follows:

\[ \text{Earnings} = \text{EBIT} - \text{Interest} - \text{Taxes} \]

We’ll use real data from major Australian companies listed on the ASX to practise cleaning, combining, and analysing financial variables like EPS and the P/E ratio. In particular we will:

Tidy messy financial data
Join multiple datasets together

The Method: Shaping and Combining Data with `tidyr` and `dplyr`

Data often needs cleaning before it’s ready to analyse. It might be too wide (with different years in separate columns), too long, or split across multiple files.

In this tutorial, we’ll use two tidyverse tools to fix that:

tidyr: to reshape messy data using functions like pivot_longer()
dplyr: to join different datasets together and spot what’s missing

These tools will help us:

Make messy data easier to work with
Combine company data with industry labels

Getting set up and loading the data

R packages for today

library(tidyverse)     # collection of packages for data manipulation and visualization

Prepare these Exercises before Class

Prepare these exercises before coming to class. Plan to spend 45 minutes on these exercises.

Exercise 1: Identifying Issues with Un-tidy Data

We’ll start with a deliberately messy version of the ASX stock price data. Run the following code to load and implement some preliminary manipulations to this dataset.

asx_prices_messy <- 
    read_csv("data/asx_prices_messy.csv")

asx_prices_messy |> 
    select(gvkey, conm, price_2023, price_2024, 
           eps_2023, eps_2024) |>
    head(10)

# A tibble: 10 × 6
   gvkey  conm                      price_2023 price_2024 eps_2023 eps_2024
   <chr>  <chr>                          <dbl>      <dbl>    <dbl>    <dbl>
 1 327267 1414 DEGREES LTD               0.04       0.067  -0.0091  -0.0107
 2 349318 29METALS LIMITED               0.375     NA      -0.799   NA     
 3 284698 3D ENERGI LIMITED              0.056      0.07    0.0129  -0.0075
 4 317959 3P LEARNING LTD                1.27       1.02    0.023   -0.208 
 5 339106 4DMEDICAL LTD                  0.45       0.675  -0.105   -0.0971
 6 314650 4DS MEMORY LTD                 0.135      0.084  -0.0038  -0.0031
 7 324960 5E ADVANCED MATERIALS INC      2.26       0.54   -0.7     -1.18  
 8 325746 5G NETWORKS LIMITED            0.125      0.148  -0.0579  -0.0836
 9 253429 88 ENERGY LTD                  0.005     NA      -0.0006  NA     
10 323980 8COMMON LTD                    0.075      0.041  -0.0147  -0.0114

(a). In what ways is this data frame not tidy? Be specific: which tidy‑data principles are violated?

(b). Suppose we want the average share price across all firms for the period 2023–2024. With this messy layout you will need to compute sum and counts across columns (i.e., sum prices for 2023, for 2024; then, count non-missing observations in these columns for 2023, 2024, etc). Use the starter code below to compute the average share price across all firms for the period 2023-2024.

asx_prices_messy |>
    summarise(
        # Sum the prices from the 2023 year
        sum_prices_2023   = sum(YOUR_CODE_HERE, na.rm = TRUE),
        # Sum the prices from the 2024 year
        sum_prices_2024   = sum(YOUR_CODE_HERE, na.rm = TRUE),
        # Count the number of firms with non-missing
        # prices in 2023
        count_firms_2023 = sum(!is.na(price_2023)),
        # Count the number of firms with non-missing
        # prices in 2024
        count_firms_2024 = sum(!is.na(price_2024))
    ) |>
    mutate(
        # Compute the total sum of prices across both years
        total_sum   = YOUR_CODE_HERE + YOUR_CODE_HERE,
        # Compute the total count of firms across both years
        total_count = count_firms_2023 + count_firms_2024
    ) |>
    mutate(
        # Manually compute the average
        avg_price_23_24 = YOUR_CODE_HERE / YOUR_CODE_HERE
    ) |> 
    select(avg_price_23_24)

(c) Explain why the approach we took in (b) doesn’t scale well as more years of data are added.

(d) Examining relationships between variables in a messy data frame can be especially tricky. Suppose we want to see whether share price and earnings per share are correlated in 2024. With this messy structure we must hard-code the year’s columns:

asx_prices_messy |>
    filter(eps_2024 > -5) |>
    ggplot(aes(x = eps_2024, y = price_2024)) +
    geom_point(alpha = 0.4) +
    geom_smooth(method = "lm", se = TRUE, color = "blue") +
    labs(
        title = "Share Price vs EPS (2024)",
        x = "Earnings per Share (EPS, 2024)",
        y = "Share Price (2024)"
    ) +
    theme_minimal()

Based on this plot, what relationship do you see in 2024? Why does this make sense economically (what are investors “buying” when they invest)?

(e). Now imagine you also wanted to include 2023 in the same analysis. You do not need to implement code to add these data points to the plot above — just explain the steps you’d need to take.

Which additional columns would you need to reference in the data?
How would your plotting code change?
What problems do you foresee if you wanted to extend this further to 2019–2024?

Exercise 2: Combining Datasets

In this exercise, we’ll combine two or more data frames using joins. Sometimes, joining adds columns. Other times, it can add rows when duplicate keys exist.

Let’s walk through both situations.

To start, let’s load the necessary data for this exercise - for two of these datasets, the code below also does some ‘light’ wrangling (similar to what we did to our datasets in last week’s tutorial):

firm_codes <- 
    read_csv("data/asx_200_2024.csv") |> 
    select(gvkey, conm, gsubind)

subindustry_names <-
    read_csv("data/GICS_subindustry.csv")

pe_data <- 
    read_csv("data/pe.csv") |> 
    select(gvkey, fyear, pe) |>
    filter(!is.na(pe)) |> 
    arrange(gvkey, fyear)

(a). We start with a data frame that lists each firm’s subindustry code (gsubind), using the largest 200 firms on the ASX in 2024:

firm_codes |> 
    head(10)

# A tibble: 10 × 3
   gvkey  conm                    gsubind
   <chr>  <chr>                     <dbl>
 1 013312 BHP GROUP LTD          15104020
 2 210216 TELSTRA GROUP LIMITED  50101020
 3 223003 CSL LTD                35201010
 4 212650 TRANSURBAN GROUP       20305020
 5 100894 WOOLWORTHS GROUP LTD   30101030
 6 212427 FORTESCUE LTD          15104050
 7 101601 WESFARMERS LTD         25503030
 8 226744 RAMSAY HEALTH CARE LTD 35102020
 9 220244 QANTAS AIRWAYS LTD     20302010
10 017525 ORIGIN ENERGY LTD      55101010

We also have a lookup table that tells us the name of each subindustry:

subindustry_names |> 
    head(10)

# A tibble: 10 × 2
    gsubind subind                              
      <dbl> <chr>                               
 1 10101010 Oil & Gas Drilling                  
 2 10101020 Oil & Gas Equipment & Services      
 3 10102010 Integrated Oil & Gas                
 4 10102020 Oil & Gas Exploration & Production  
 5 10102030 Oil & Gas Refining & Marketing      
 6 10102040 Oil & Gas Storage & Transportation  
 7 10102050 Coal & Consumable Fuels             
 8 15101010 Commodity Chemicals                 
 9 15101020 Diversified Chemicals               
10 15101030 Fertilizers & Agricultural Chemicals

Use a left join to add the subindustry name to each firm in the firm_codes dataset using the variable gsubind as the join key. The code below gets you started:

firm_info <- 
    YOUR_DATASET_NAME |>
    left_join(YOUR_DATASET_NAME, by = "YOUR_VARIABLE_NAME")

firm_info |> 
    head(10)

(b). What kind of join did we just perform? To help answer this question, consider the following:

What kind of thing are we adding: columns or rows?
Did each gsubind only appear once in the dataset subindustry_names?
Does this kind of join have a name? If so, what is it?

(c).

Now let’s bring in PE ratio data, which has multiple years per firm:

pe_data |> 
    head(10)

# A tibble: 10 × 3
   gvkey  fyear    pe
   <chr>  <dbl> <dbl>
 1 005302  2008  41.7
 2 010991  2004  27.2
 3 010991  2005  41.8
 4 010991  2006  45.0
 5 010991  2007  29.7
 6 010991  2008 135. 
 7 010991  2021 979. 
 8 010991  2023  57.0
 9 013312  2005  21.3
10 013312  2006  14.8

Use a left join to add the P/E data to each firm in the firm_info dataset using the variable gvkey as the join key. Use the starter code below to implement this join:

firm_pe_data <- 
    YOUR_DATASET_NAME |>
    YOUR_JOIN_TYPE(YOUR_DATASET_NAME, by = "YOUR_VARIABLE_NAME")

firm_pe_data |>
  head(10) |>
  select(pe, everything())

(d). What kind of join did we just perform? To help answer this question, consider the following:

What kind of thing are we adding: columns or rows?
How many rows are in the dataset firm_pe_data? How about firm_info?
Does each gvkey in the original table (firm_info) now appear once or multiple times in the ‘joined’ data frame (firm_pe_data)? Why?
Does this kind of join have a name? If so, what is it?

In-Class Exercises

You will discuss these exercises in class with your peers in small groups and with your tutor. These exercises build from the exercises you have prepared above, you will get the most value from the class if you have completed those above before coming to class.

Exercise 3: Tidying and Unlocking Scalable Insights

(a). We’ll now tidy the messy stock price dataset from Exercise 1 so we can more easily analyse and visualise it. Use the starter code below to turn your wide-format data back into a tidy format using pivot_longer().

asx_prices_tidy <- 
    asx_prices_messy |>
    pivot_longer(
        cols = YOUR_CODE_HERE,
        names_to = c(".value", "fyear"),
        names_sep = YOUR_CODE_HERE,
        values_drop_na = TRUE
    ) |>
    mutate(fyear = as.numeric(fyear))

asx_prices_tidy |> 
    head(10)

(b). Why is this structure better for analysis? Why might calculations or visualisations now be easier?

(c). Let’s re-do a simple tasks from Exercise 1, that was quite complicated to implement — and see how much easier it is now that our data is tidy. What’s the average share price for all firms across 2023 and 2024?

asx_prices_tidy |>
    YOUR_CODE(fyear %in% c(YOUR_YEARS)) |>
    YOUR_CODE(avg_price = mean(YOUR_VARIABLE, na.rm = TRUE))

(d). Let’s re-do our plot from Exercise 1 too. Plot price vs EPS over the full sample (2019–2024).

(e). Suppose we wanted to modify the plot in (d) to only include the years 2023 and 2024. Modify the code you just wrote to make this change by adding one new line of code.

(f). What kind of relationship do you observe in the plot in (d)? What might explain firms that lie above or below the fitted line? For EPS < 0, does our line-of-best-fit ‘make sense’?

Exercise 4: What’s Missing? Understanding Joins by Seeing What They Drop

In Exercise 2, we used left_join() to combine data from different tables — like firm details, subindustry names, and P/E ratios. That helped us build one data frame with all the information in one place. Now, we’ll go a step further and explore what different kinds of joins keep and drop. This can help us spot missing or underrepresented data — and understand our dataset more deeply.

(a). Let’s start by counting how many firms in the ASX200 belong to each subindustry. Use the starter code below to return the three most common industries:

firm_info |> 
    YOUR_CODE(subind) |> 
    YOUR_CODE(desc(n)) |> 
    YOUR_CODE(3)

And, now the three least common:

firm_info |> 
    YOUR_CODE(subind) |> 
    YOUR_CODE((n)) |> 
    YOUR_CODE(3)

(b). Are there any subindustries that don’t appear in the ASX200 sample-and so are even less common those that we identify in (a)?

We can use anti_join() to find subindustries listed in subindustry_names but missing from our firms data. How many subindustries are not included in the ASX200?

YOUR_CODE |> 
    YOUR_CODE(firm_info, by = "subind") |> 
    nrow()

(c). What kinds of industries are missing? Why might they be underrepresented in the ASX200?

(d). An inner_join() keeps only the firms that have a matching subindustry in the lookup table. Recreate the join you did in Exercise 2(a), but this time use inner_join() instead of left_join().

firm_info_inner <- 
    firm_codes |>
    YOUR_CODE(subindustry_names, by = "gsubind")

(e) Are there any differences in the data frame in (d) to the one you constructed in 2 (a)? Explain your answer.

(f). Over this exercise and Exercise 2, you’ve used three different types of joins:

left_join()
anti_join()
inner_join()

Each one gives you a slightly different view of how your data frames relate to each other.

In your own words, answer the following:

What does each join keep and what does it drop?
Which join is most useful when you want to preserve all rows from one table, even if they don’t match?
Which join is best for finding mismatches between two tables?
Which join only keeps perfect matches between the two tables?