Tutorial 5: Shaping and Joining Data for Business Analytics

Learning Goals

By the end of this tutorial, you should be able to:

  • Explain the difference between tidy and messy data, and use pivot_longer() to tidy up wide data.
  • Summarize and visualise financial data, and explain why tidy formats make analysis easier.
  • Plot and interpret relationships between share price and earnings per share (EPS).
  • Combine data from multiple tables using joins (left_join(), inner_join(), anti_join()), and understand what each join includes or excludes.

The Business Challenge

Investors do not just care about whether a company is profitable. They also care about how much they are paying for those earnings.

In this tutorial, we will work with data on large Australian firms to explore a simple question:

How strongly are share prices and valuation ratios related to company earnings, and how do those patterns vary across firms and industries?

To answer that question, we first need to prepare the data. Some of the financial variables are stored in a messy wide format, and other information is spread across separate files. So before we can analyse anything, we need to:

  • reshape the data into tidy form
  • join together firm, industry, and valuation data
  • check what information is missing after those joins

Two key variables in today’s tutorial are:

\[ \text{EPS} = \frac{\text{Profit}}{\text{Shares Outstanding}} \]

EPS tells us how much profit a company earns for each share.

\[ \text{P/E} = \frac{\text{Share Price}}{\text{EPS}} \]

The P/E ratio tells us how much investors are willing to pay for each dollar of earnings.

By the end of the tutorial, you will have built and explored a dataset that lets you compare firms and industries using share prices, earnings, and valuation ratios.

R packages for today

library(tidyverse)     # collection of packages for data manipulation and visualization

Prepare these Exercises before Class

Prepare these exercises before coming to class. Plan to spend 30 minutes on these exercises.

Exercise 1: Identifying Issues with Untidy Data

We will start with a deliberately messy version of the ASX stock price data. Run the following code to load the dataset.

asx_prices_messy <- read_csv("data/asx_prices_messy.csv")

Then inspect a subset of the data:

asx_prices_messy |>
    select(gvkey, conm, price_2023, price_2024,
           eps_2023, eps_2024) |>
    head(10)
# A tibble: 10 × 6
   gvkey  conm                      price_2023 price_2024 eps_2023 eps_2024
   <chr>  <chr>                          <dbl>      <dbl>    <dbl>    <dbl>
 1 272690 1300 SMILES LTD               NA         NA      NA       NA     
 2 327267 1414 DEGREES LTD               0.04       0.067  -0.0091  -0.0107
 3 349318 29METALS LIMITED               0.375     NA      -0.799   NA     
 4 284698 3D ENERGI LIMITED              0.056      0.07    0.0129  -0.0075
 5 346142 3D METALFORGE LIMITED         NA         NA      NA       NA     
 6 317959 3P LEARNING LTD                1.27       1.02    0.023   -0.208 
 7 253413 3Q HOLDINGS LTD               NA         NA      NA       NA     
 8 339106 4DMEDICAL LTD                  0.45       0.675  -0.105   -0.0971
 9 314650 4DS MEMORY LTD                 0.135      0.084  -0.0038  -0.0031
10 324960 5E ADVANCED MATERIALS INC      2.26       0.54   -0.7     -1.18  

(a) In what ways is this data frame not tidy? What principles of tidy data are violated?

(b) Use the starter code below to compute the average share price across all firms for 2023 and 2024.

asx_prices_messy |>
    summarise(
        # Add up all 2023 share prices
        sum_prices_2023 = sum(YOUR_CODE_HERE, na.rm = TRUE),
        # Add up all 2024 share prices
        sum_prices_2024 = sum(YOUR_CODE_HERE, na.rm = TRUE),
        # Count how many firms have a non-missing share price in 2023
        count_firms_2023 = sum(!is.na(price_2023)),
        # Count how many firms have a non-missing share price in 2024
        count_firms_2024 = sum(!is.na(price_2024))
    ) |>
    mutate(
        # Combine the 2023 and 2024 price totals
        total_sum = YOUR_CODE_HERE + YOUR_CODE_HERE,
        # Combine the number of non-missing firm prices across both years
        total_count = count_firms_2023 + count_firms_2024
    ) |>
    mutate(
        # Divide the total sum of prices by the total number of firms
        avg_price_23_24 = YOUR_CODE_HERE / YOUR_CODE_HERE
    ) |> 
    select(avg_price_23_24)

(c) Why does the approach in part (b) not scale well as more years are added?

(d) Complete the code below to plot share price against earnings per share in 2024.

asx_prices_messy |>
    filter(eps_2024 > -5) |>
    ggplot(aes(x = eps_2024, y = price_2024)) +
    YOUR_CODE_HERE +
    YOUR_CODE_HERE +
    labs(
        title = "YOUR_LABEL_HERE",
        x = "YOUR_LABEL_HERE",
        y = "YOUR_LABEL_HERE"
    ) +
    theme_minimal()

What relationship do you see? Why does this make sense economically?

(e) Suppose you also wanted to include 2023 in this analysis. Briefly explain:

  • which extra columns you would need
  • what you would need to change in the plotting process
  • why this approach would become awkward if extended to 2019 to 2024

(f) In two or three sentences, explain to a manager why this data structure makes analysis harder than it needs to be.

Exercise 2: Combining Datasets

In this exercise, you will combine information from two datasets using a join.

To start, load the data:

firm_codes <-
    read_csv("data/asx_200_2024.csv") |>
    select(gvkey, conm, gsubind)

subindustry_names <-
    read_csv("data/GICS_subindustry.csv")

Take a quick look at the two datasets:

firm_codes |>
    head(10)
# A tibble: 10 × 3
   gvkey  conm                    gsubind
   <chr>  <chr>                     <dbl>
 1 013312 BHP GROUP LTD          15104020
 2 210216 TELSTRA GROUP LIMITED  50101020
 3 223003 CSL LTD                35201010
 4 212650 TRANSURBAN GROUP       20305020
 5 100894 WOOLWORTHS GROUP LTD   30101030
 6 212427 FORTESCUE LTD          15104050
 7 101601 WESFARMERS LTD         25503030
 8 226744 RAMSAY HEALTH CARE LTD 35102020
 9 220244 QANTAS AIRWAYS LTD     20302010
10 017525 ORIGIN ENERGY LTD      55101010
subindustry_names |>
    head(10)
# A tibble: 10 × 2
    gsubind subind                              
      <dbl> <chr>                               
 1 10101010 Oil & Gas Drilling                  
 2 10101020 Oil & Gas Equipment & Services      
 3 10102010 Integrated Oil & Gas                
 4 10102020 Oil & Gas Exploration & Production  
 5 10102030 Oil & Gas Refining & Marketing      
 6 10102040 Oil & Gas Storage & Transportation  
 7 10102050 Coal & Consumable Fuels             
 8 15101010 Commodity Chemicals                 
 9 15101020 Diversified Chemicals               
10 15101030 Fertilizers & Agricultural Chemicals

(a) Look at the variables in firm_codes and subindustry_names. Which variable or variables look like they could be used to join these two datasets?

(b) Which join variable would you choose? Explain why it is the best choice here.

(c) Use a left join to add the subindustry name to each firm in firm_codes.

firm_info <-
    YOUR_DATASET_NAME |>
    left_join(YOUR_DATASET_NAME, by = join_by(YOUR_VARIABLE_NAME))

firm_info |>
    head(10)

(d) What changed after the join? Briefly explain:

  • what information was added
  • whether the join mainly added rows or columns
  • why this makes the dataset more useful for analysis

(e) Look at how gsubind behaves in the two datasets.

  • In which dataset would you expect each gsubind to appear only once?
  • In which dataset can the same gsubind appear many times?
  • Why does that make sense?

(f) Fill in the variable name in each sentence below.

In subindustry_names, _____ behaves like a primary key because each code identifies one subindustry label.

In firm_codes, _____ behaves like a foreign key because many firms can share the same subindustry code.

(g) Suppose we reversed the join and started with subindustry_names on the left-hand side. What would be different about the resulting dataset, and why?

In-Class Exercises

You will discuss these exercises in class with your peers in small groups and with your tutor. These exercises build from the exercises you have prepared above, you will get the most value from the class if you have completed those above before coming to class.

Exercise 3: Tidying Data to Make Analysis Easier

(a) Look at the column names in asx_prices_messy, and answer the following:

  • What two pieces of information are stored in these column names?
  • What character separates those two pieces of information?

(b) We now want to reshape the stock price data so that each row corresponds to one firm in one year. Complete the starter code below to turn the data into a tidy format.

asx_prices_tidy <-
    asx_prices_messy |>
    pivot_YOUR_CODE(
        cols = YOUR_CODE_HERE,
        names_to = c(".value", "fyear"),
        names_sep = YOUR_CODE_HERE,
        values_drop_na = TRUE
    ) |>
    mutate(fyear = as.numeric(fyear))

asx_prices_tidy |>
    head(10)

(c) In Exercise 1, calculating the average share price across 2023 and 2024 took several steps. Use the tidy dataset to calculate the average share price across those two years.

asx_prices_tidy |>
    YOUR_CODE(fyear %in% c(YOUR_YEARS)) |>
    YOUR_CODE(avg_price = mean(YOUR_VARIABLE, na.rm = TRUE))

(d) Why is the calculation in part (c) easier to write using the tidy dataset than it was in Exercise 1?

What would we need to do if we added many more years of data?

(e) We now want to recreate our plot of share price against EPS using the tidy dataset.

Complete the code below.

asx_prices_tidy |>
    filter(eps > -5) |>
    # create a scatterplot with EPS on the x-axis and price on the y-axis
    YOUR_CODE() +
    # add points
    YOUR_CODE +
    # add a fitted straight line
    YOUR_CODE +
    # add a title and axis labels
    YOUR_CODE +
    # use a theme
    YOUR_CODE

(f) Modify your code from part (e) so that the plot only shows observations from 2023 and 2024. Make this change by adding just one line to your existing pipeline.

(g) Look at your plot from part (e) and answer the following:

  • What overall relationship do you observe between EPS and share price?
  • What might explain firms that lie well above or below the fitted line?
  • Do firms with very low or negative EPS seem to follow the same straight-line pattern as the rest of the data?

Exercise 4: Adding P/E Ratios with a Join

(a) We want to add P/E ratios to our tidy ASX firm-year dataset.

Run the code below to load and prepare the P/E ratio data.

pe_data <-
    read_csv("data/pe.csv") |>
    select(gvkey, fyear, pe) |>
    filter(!is.na(pe)) |>
    arrange(gvkey, fyear)

What variables in pe_data seem to identify each row?

(b) In asx_tidy, what does one row represent?

Which variable or variables together identify a unique observation in this dataset?

(c) Which two variables should we use to join asx_tidy and pe_data?

(d) Why do we need both of these variables in the join?

(e) Complete the code below to add the pe variable from pe_data to asx_tidy.

asx_with_pe <-
    asx_tidy |>
    left_join(
        pe_data,
        by = YOUR_CODE_HERE
    )

(f) After the join, what changed in the dataset?

In your answer, explain:

  • what new information was added
  • whether the join mainly added rows or columns

(g) Now suppose we join the two datasets using only gvkey.

Run the code below.

asx_wrong_join <-
    asx_tidy |>
    left_join(
        pe_data,
        by = YOUR_CODE_HERE
    )

What goes wrong here, and why?

Exercise 5: Comparing Industries with Grouped Summaries

(a) We now want to compare valuation patterns across industries.

To do that, we first need to add industry codes to asx_with_pe.

Complete the code below to join firm_info, which you created in Exercise 2, onto asx_with_pe using gvkey.

asx_analysis <-
    asx_with_pe |>
    YOUR_CODE(
        YOUR_CODE,
        by = YOUR_CODE_HERE
    )

What new information does this join add?

(b) Using asx_analysis, create a summary dataset called industry_year_summary that does all of the following:

  • groups the data by subindustry and year
  • calculates the average P/E ratio
  • calculates the average share price
  • calculates the average EPS
  • calculates the number of firms in each subindustry-year
industry_year_summary <-
    asx_analysis |>
    YOUR_CODE()

(c) Why do we need to group by both subindustry and year here, rather than just one of them?

(d) Using industry_year_summary, create a scatterplot that does all of the following:

  • puts average P/E ratio on the x-axis
  • puts average share price on the y-axis
  • shows one point for each subindustry-year observation
  • includes a summary line describing the pattern in the data
  • includes a clear title and axis labels
  • uses a theme
  • only includes subindustries that have at least 3 firms
  • only includes subindustries with average P/E ratios less than 100
industry_year_summary |> 
    YOUR_CODE()

(e) What do you notice in the plot?

In a 2-3 sentences, comment on:

  • whether industry-year observations with higher average P/E ratios also tend to have higher average share prices
  • whether there are any clear clusters or outliers
  • what kinds of industry differences might help explain the pattern

Exercise 6: Auditing Missing P/E Ratios Before an Analytics Meeting

Suppose your team is preparing for an analytics meeting. You have built a firm-year dataset with share prices, earnings, and P/E ratios, but many firm-year observations still have missing pe values.

Before moving ahead, you want to understand:

  • what the full dataset looks like
  • what the complete-match dataset looks like
  • which observations are missing P/E
  • what you should ask the data team before continuing

(a) In Exercise 4, we used a left_join() to add P/E ratios from pe_data to asx_tidy.

What happened to the rows in asx_tidy that did not have a matching P/E ratio in pe_data? Briefly explain what the join did in those cases.

(b) Run the code below, which uses a different kind of join, called inner_join().

asx_with_pe_inner <-
    asx_prices_tidy |>
    inner_join(
        pe_data,
        by = join_by(gvkey, fyear)
    )

How is asx_with_pe_inner different from asx_with_pe? What does inner_join() keep that left_join() does not?

(c) Run the code below, which uses another kind of join, called anti_join().

missing_pe <-
    asx_prices_tidy |>
    anti_join(
        pe_data,
        by = join_by(gvkey, fyear)
    )

What does missing_pe represent? How is this different from the dataset created with inner_join() in part (b)?

(d) Compare the number of rows in the following datasets.

nrow(asx_prices_tidy)
nrow(asx_with_pe)
nrow(asx_with_pe_inner)
nrow(missing_pe)

What do you notice? What does this tell you about how much of the firm-year dataset has valid P/E coverage?

(e) Look at the observations in missing_pe.

Use code to investigate whether missing P/E ratios seem to be concentrated in certain years or associated with certain EPS values. Write one or two short findings based on what you see.

(f) The pe_data file was provided to us as a separate dataset.

Why might some firm-year observations in asx_tidy still fail to find a matching P/E ratio in that file, even when price and eps are present?

In your answer, think about possible differences in:

  • which firms were included
  • which years were covered
  • how P/E ratios were defined or filtered before the file was given to us

(g) Suppose you are preparing for a meeting with the data team before using pe in your analysis.

What are two or three specific questions you would ask to check whether the missing P/E values reflect:

  • a normal feature of the data
  • a difference in firm or year coverage
  • a problem in how the file was sourced or matched

If the missing values seem to come from incomplete firm coverage, what might your team do next to source the missing data?