Tutorial 3: Data Visualisation for Business Intelligence

Learning Goals

By the end of this tutorial, you should be able to:

Explain how different ggplot2 visualization techniques reveal patterns in the Melbourne housing market that would be difficult to see in tables or raw numbers.
Use ggplot2 to create bar plots, histograms, and scatterplots that analyze relationships between property prices, location, and dwelling types.
Apply transformations such as log scales, proportions, and faceting to improve the interpretability of price distributions and spatial trends.
Customize ggplot2 plots using themes, color palettes, and annotations to enhance clarity and storytelling.
Evaluate and compare visualization approaches, identifying which effectively communicate insights about housing price trends and spatial patterns.
Build and refine a visual analysis workflow that could inform practical decision-making for home buyers, investors, or urban planners

The Business Challenge

The Topic: Understanding Melbourne’s Housing Market

Melbourne’s real estate market is one of the most dynamic in Australia. Property prices are influenced by several factors, including location, dwelling type, and proximity to the Central Business District (CBD). Buyers, sellers, and policymakers constantly analyze housing trends to make informed decisions. Understanding these trends is crucial in a $2 trillion+ housing market, where even small fluctuations in pricing expectations can have significant financial impacts.

A key question we will explore is: How do property prices vary by region, dwelling type, and location? By analyzing real estate data, we can identify patterns that help explain the drivers of Melbourne’s housing prices and their implications for different stakeholders in the market.

The Data: The Melbourne Housing Market Dataset

We will use a dataset containing real estate sales data from Melbourne. The dataset provides insights into housing trends and includes key variables that influence property prices.

The dataset captures the property location within broader metropolitan regions (regionname), allowing us to compare trends across different areas. It also includes the type of dwelling, classified as House (h), Townhouse (t), or Apartment (u), enabling us to assess price differences based on housing type. The sale price (price) provides a measure of market value, while the distance to the CBD (distance) helps analyze the impact of proximity to the city center on property prices.

Similar datasets are widely used by real estate analysts, banks, and property investors to assess housing trends, predict future movements, and guide investment decisions. Through this tutorial, we’ll work with real-world data to uncover patterns and develop a deeper understanding of Melbourne’s property market. Let’s dive in!

Where we’re headed

Just a few lines of R code transform numbers to data visualizations.

From this:

# A tibble: 10 × 6
   type      address          postcode   price distance regionname    
   <chr>     <chr>               <dbl>   <dbl>    <dbl> <chr>         
 1 House     85 Turner St         3067 1480000      2.5 Northern Metro
 2 House     25 Bloomburg St      3067 1035000      2.5 Northern Metro
 3 House     5 Charles St         3067 1465000      2.5 Northern Metro
 4 House     40 Federation La     3067  850000      2.5 Northern Metro
 5 House     55a Park St          3067 1600000      2.5 Northern Metro
 6 House     129 Charles St       3067  941000      2.5 Northern Metro
 7 House     124 Yarra St         3067 1876000      2.5 Northern Metro
 8 House     98 Charles St        3067 1636000      2.5 Northern Metro
 9 House     217 Langridge St     3067 1000000      2.5 Northern Metro
10 Townhouse 18a Mollison St      3067  745000      2.5 Northern Metro

to this:

Loading the Data

R packages for today

library(tidyverse)  # for plotting, includes ggplot
library(patchwork)  # for combining multiple plots into subfigures
library(scales)     # for formatting axis scales
library(ggokabeito) # color blind friendly color palette -- this course's default
library(ggthemes)   # some extra styling & themes to explore

Loading the Data in R

housing <-
    read_csv("data/melbourne_housing.csv")

Prepare these Exercises before Class

Prepare these exercises before coming to class. Plan to spend 45 minutes on these exercises.

Switch on the eval flag when you want to evaluate code!

In the R code chunks below we have provided starter code for you to work from. We have set the key eval to the value false so that they are not run because they have syntax such as YOUR_VALUE_HERE which would generate errors.

Switch the eval value to true when you want the R code within a chunk to be run when you compile your document.

Exercise 1: Plotting Number of Listings by Region

By the end of this exercise, your plot should look similar to this one:

(a). If you were a real estate investor or city planner, why would knowing the number of listings by region be valuable? What decisions could be influenced by this visualization?

(b). Use the starter code below to create a first version of the bar plot. Replace YOUR_VARIABLE_NAME with the appropriate variable name from the dataset.

What is a Bar Plot?

A bar plot is a graphical representation of categorical data, where each category is represented by a bar whose height reflects its frequency or proportion. It is commonly used to compare counts across different groups, making it easy to identify the most and least common categories in a dataset. In business and research, bar plots help visualize trends, distributions, and key differences in data at a glance.

ggplot(housing, 
       aes(x = YOUR_VARIABLE_NAME)
       ) +
    geom_bar()

(c). Extend the plot above by adding labels to the x-axis, y-axis, and title. Use the labs() function to add these labels. The code below also rotates the axis labels by 25 degrees to improve readability.

ggplot(housing, 
       aes(x = YOUR_VARIABLE_NAME)
       ) +
    geom_bar() +
    labs(x = "YOUR_X_AXIS_LABEL", 
         y = "YOUR_Y_AXIS_LABEL", 
         title = "YOUR_TITLE"
         ) +
    theme(
        axis.text.x = element_text(angle = 25, hjust = 1),
    )

(d). What patterns do you notice in the number of listings? Are certain regions overrepresented or underrepresented? How might this affect housing supply and pricing?

Exercise 2: Visualizing the distribution of Prices by Dwelling Type

By the end of this exercise, your plot should look similar to this one:

(a). Why is a histogram useful for understanding price distributions? What insights can we gain from plotting price distributions separately for each dwelling type?

What is a Histogram?

A histogram is a visualization that represents the distribution of numerical data by dividing values into bins and counting the number of observations in each bin. Unlike a bar plot, which displays categorical data, a histogram is used for continuous variables like price, age, or distance.

Histograms help identify patterns in data, such as skewness, outliers, and central tendencies, making them a crucial tool for understanding how values are distributed in a dataset.

(b). Use the starter code below to create a basic histogram of house prices. Replace YOUR_VARIABLE_NAME with the correct variable.

ggplot(housing, 
       aes(x = YOUR_VARIABLE_NAME)
       ) +
    geom_histogram(bins = 50)

(c). Extend the plot above by adding labels to the x-axis, y-axis, and title using the labs() function.

ggplot(housing, 
       aes(x = YOUR_VARIABLE_NAME)
       ) +
    geom_histogram(bins = 50) +
    labs(x = "YOUR_X_AXIS_LABEL", 
         y = "YOUR_Y_AXIS_LABEL", 
         title = "YOUR_TITLE"
         )

(d). Modify the x-axis so that prices are displayed in dollar format by adding scale_x_continuous(labels = scales::dollar). We have rotated the axis labels to make the numbers easier to read.

ggplot(housing, 
       aes(x = price)
       ) +
    geom_histogram(bins = 50) +
    YOUR_CODE_HERE +
    labs(x = "YOUR_X_AXIS_LABEL", 
         y = "YOUR_Y_AXIS_LABEL", 
         title = "YOUR_TITLE"
         ) +
    theme(
        axis.text.x = element_text(angle = 55, hjust = 1),
    )

(e). Add fill = type inside aes() to color the histogram by dwelling type and then create subplots by dwelling type.

ggplot(housing, 
       aes(x = price, 
           YOUR_CODE_HERE
           )
       ) +
    geom_histogram(bins = 50) +
    YOUR_CODE_HERE +
    labs(x = "YOUR_X_AXIS_LABEL", 
         y = "YOUR_Y_AXIS_LABEL", 
         title = "YOUR_TITLE"
         ) +
    theme(
        axis.text.x = element_text(angle = 55, hjust = 1),
    ) +
    YOUR_CODE_HERE

(f). What does the histogram tell you about price differences between dwelling types?

Exercise 3: The Price - Distance Relationship

By the end of this exercise, your plot should look similar to this one:

`geom_smooth()` using formula = 'y ~ x'

(a). Why might we expect a relationship between price and distance to the CBD?

(b). Use the starter code below to build a scatter plot of price vs. distance, replacing YOUR_VARIABLE_NAME with the correct column names.

What is a Scatter Plot?

A scatter plot is used to visualize the relationship between two continuous variables by plotting individual data points. Each dot represents one observation, helping identify patterns, clusters, and trends. In this case, it allows us to explore how house prices change with distance.

ggplot(housing, 
       aes(x = YOUR_VARIABLE_NAME, y = YOUR_VARIABLE_NAME)
       ) +
    geom_point()

(c). Since we have a large number of data points, many overlap, making it difficult to distinguish individual observations. Modify the geom_point() function to reduce over-plotting by adjusting the alpha value. Experiment with different values (e.g., 0.1, 0.05, 0.01) and describe how each affects the visualization.

ggplot(housing, 
       aes(x = YOUR_VARIABLE_NAME, y = YOUR_VARIABLE_NAME)
       ) +
    geom_point(alpha = YOUR_NUMBER)

(d). We can better understand the relationship by adding a statistical transformation to the plot. Complete the geom_smooth() function call to overlay a straight line over the data. The se = FALSE argument removes the confidence interval around the line.

ggplot(housing, 
       aes(x = YOUR_VARIABLE_NAME, 
           y = YOUR_VARIABLE_NAME)
       ) +
    geom_point(alpha = YOUR_NUMBER) +
    geom_smooth(method = YOUR_METHOD, 
                se = FALSE)

(e). Finally, add labels to the x-axis, y-axis, and title using the labs() function. Adjust the x- and y-axis scales as you deem appropriate.

ggplot(housing, 
       aes(x = YOUR_VARIABLE_NAME, y = YOUR_VARIABLE_NAME)
       ) +
    geom_point(alpha = YOUR_NUMBER) +
    geom_smooth(method = YOUR_METHOD, 
                se = FALSE) +
    scale_y_continuous(YOUR_CODE) +
    labs(x = "YOUR_LABEL", 
         y = "YOUR_LABEL", 
         title = "YOUR_TITLE"
         )

(f). What does the scatter plot reveal about the relationship between house prices and distance to the CBD? How does the line help interpret this relationship?

In-Class Exercises

You will discuss these exercises in class with your peers in small groups and with your tutor. These exercises build from the exercises you have prepared above, you will get the most value from the class if you have completed those above before coming to class.

Exercise 4: Improving the Listings by Region Plot

(a). Together with your peers, propose three changes you want to make to the plot in Exercise 1. Discuss the rationale behind each change and how it might improve the visualization.

(b). Work with your tutor to create an agreed upon list of changes to make to the plot. What steps do you need to take to make these improvements?

(c). Implement the changes suggested in the R code.

Exercise 5: Improving the Price Distribution Plot

(a). Together with your peers, propose three changes you want to make to the plot in Exercise 2. Discuss the rationale behind each change and how it might improve the visualization.

(b). Work with your tutor to create an agreed upon list of changes to make to the plot. What steps do you need to take to make these improvements?

(c). Implement the changes suggested in the R code.

Exercise 6: Improving the Price-Distance Relationship Plot

(a). Together with your peers, propose three changes you want to make to the plot in Exercise 3. Discuss the rationale behind each change and how it might improve the visualization.

(b). Work with your tutor to create an agreed upon list of changes to make to the plot. What steps do you need to take to make these improvements?

(c). Implement the changes suggested in R.

Exercise 7: Putting the plots together

(a). Use the code below to combine the plots for the three exercises above. Replace PLOT_ONE, PLOT_TWO, and PLOT_THREE with the plots you created in Exercises 5, 6, and 7.

PLOT_ONE /    
    (PLOT_TWO | PLOT_THREE) + 
    plot_layout(guides = "collect") & 
    theme(legend.position = "bottom")

(b). Save this combined plot as a PDF file using the ggsave() function. You can specify the file name and dimensions using the filename and width and height arguments.

ggsave("YOUR_FILE_NAME", width = 12, height = 8)

Exercise 8: Synthesizing the Findings

If you were presenting these results in a business meeting, how would you explain the key takeaways? Structure your summary to include:

What you analyzed (data, variables)
What you found (key insights from the visualizations)
What decisions could be made based on this information?