Tutorial 6: Using Variation in Data to Tell a Story
Learning Goals
By the end of this tutorial, you should be able to:
Identify whether a dataset contains cross-sectional, time series, or panel variation
Diagnose whether a dataset is tidy and explain how its structure affects analysis
Tidy and join multiple country-level datasets to build a usable country-year analysis dataset
Formulate and scope a question that can be answered with the available data
Create a clear graph or table that helps communicate a relationship in the data
Write a short, non-technical brief that explains what the data show and why it matters for policy
The Policy Challenge
An international policy think tank is preparing a briefing on how countries change as they develop. Some policymakers are interested in whether rising incomes are closely tied to longer lives. Others want to know whether higher incomes are associated with greater urbanisation and the pressures that come with it, such as demand for housing, transport, and public services.
Your team has been given a set of messy country-level datasets, but the files are spread across multiple tables and are not yet ready for analysis. Your task is to work out what kind of variation the data contain, turn them into a usable dataset, and produce clear evidence that can support a short policy brief.
R packages for today
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.2.0 ✔ readr 2.2.0
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.2 ✔ tibble 3.3.1
✔ lubridate 1.9.5 ✔ tidyr 1.3.2
✔ purrr 1.2.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggokabeito)
Prepare these Exercises before Class
Prepare these exercises before coming to class. Plan to spend 40 minutes on these exercises.
Exercise 1: Recognising Different Kinds of Data
Data can vary across people, firms, products, places, and time. Before we start working with this week’s datasets, it is useful to step back and think about the kinds of variation different datasets contain.
(a) Look back at the datasets in the table below from earlier in the course. For each one:
classify the dataset as cross-sectional, time series, or panel
describe what one row represents
identify the cross-sectional unit where relevant
identify the time unit where relevant
Dataset
Where it appeared in the course
Data type
What does one row represent?
Cross-sectional unit
Time unit
Cookie Cats
Week 2 Lecture
Dominick Beer Sales
Week 3 Lecture
Melbourne Housing
Week 4 Tutorial
ASX 200 Firm Financials
Week 4 Lecture
(b) In your own words, explain the difference between the following three types of data:
Cross-sectional data
Time series data
Panel data
Exercise 2: Using Variation in Data to Tell a Story
Watch the Gapminder video available here and answer the following questions:
What relationship the visualisation is trying to show?
What makes the plot effective?
What is one thing that could be improved in the plot they use?
Exercise 3: Inspecting the Structure of the Data
In class, you will work with several small datasets based on Gapminder-style country data. Before you start working with them, it is useful to inspect how each file is structured.
(a) Open each of the following files from the data folder:
gdp.csv
urban.csv
pop.csv
country_to_continent.csv
For each dataset, complete the table below.
Dataset
What does one row represent?
Tidy or untidy?
What is the main structural issue?
gdp.csv
urban.csv
pop.csv
country_to_continent.csv
Hint
When describing the main structural issue, be as specific as you can. For example, think about whether a variable is stored in the column names, or whether values that belong in one column have been spread across many columns.
(b) Using what you observed in part (a), explain whether gdp.csv, urban.csv, and pop.csv contain cross-sectional variation, time variation, or both.
Then explain how the current layout of the files hides or reveals that structure.
Exercise 4: Proposing a Question
You have now inspected the datasets and thought about the kind of variation they contain.
(a) Write one question that you think could be answered using the datasets in this week’s data folder. Your question should be specific enough that it could be explored using a table or visualisation.
(b) For the question you wrote in part (a), answer the following using dot points:
Is it descriptive, predictive, or causal?
Why does it belong to that category?
Which dataset(s) or variable(s) would you need?
(c) Explain how you would scope the analysis:
Which countries or regions might you include?
Which years might you include?
What kind of variation would you be using: cross-sectional variation, time variation, or both?
Why do these choices make sense for your question?
In-Class Exercises
You will discuss these exercises in class with your peers in small groups and with your tutor. These exercises build from the exercises you have prepared above, you will get the most value from the class if you have completed those above before coming to class.
Exercise 5: Framing Your Group’s Analysis
Your tutor will assign your group one of the following questions:
How does life expectancy relate to GDP per capita?
How does urbanisation relate to GDP per capita?
As a group, complete the tasks below before you begin wrangling and analysing the data.
(a) Write down the question your group was allocated.
(b) Using dot points, answer the following:
Is this best treated as a descriptive, predictive, or causal question with the data available in class?
Why?
(c) Using dot points, answer the following:
Which countries or regions will you include?
Which years will you include?
Will you use cross-sectional variation, time variation, or both?
Why is this a sensible scope for today’s class?
Exercise 6: Build Your Analysis Dataset
Work with your group to build the dataset you need to answer your allocated question. You should decide for yourselves how to prepare the data for analysis.
Hint
Depending on your question, you may need to think about some of the following:
which files you need to load
whether any datasets need tidying before they can be analysed
whether any datasets need to be joined together
whether you want to focus on particular countries, regions, or years
whether there are missing values that affect the years or observations you can analyse
what one row in your final analysis dataset should represent
Hint: Reshaping year columns
Some of the datasets store the year inside the column names rather than in a separate column. In cases like this, pivot_longer() can help you move from a wide format to a tidy format.
For example, if your GDP dataset has columns such as gdp_pcap_pp_1960, gdp_pcap_pp_1961, and so on, you could reshape it like this:
gdp <- gdp |>pivot_longer(cols =starts_with("gdp_pcap_pp_"),names_to ="year",names_pattern ="gdp_pcap_pp_(\\d{4})$",values_to ="gdp_pcap_pp" )
Here is what each part is doing:
cols = starts_with("gdp_pcap_pp_") selects the year columns you want to reshape
names_to = "year" creates a new column called year
names_pattern = "gdp_pcap_pp_(\\d{4})$" extracts the 4-digit year from the original column names
values_to = "gdp_pcap_pp" puts the GDP values into a single column
After this step, the dataset will be much closer to tidy form, where each row can represent one country-year observation.
Exercise 7: Producing Evidence that Answers Your Question
Using the dataset your group prepared in Exercise 6, produce one or two graphs or tables that help answer your allocated question.
Your output should match the scope your group chose in Exercise 5.
What should a good output make clear?
Your graph or table should be clear enough that another group could understand:
what variables you are showing
what subset of the data you chose to use
what pattern or relationship the output is meant to highlight
Presentation reminders
If you create a graph, make sure it has informative axis labels and a clear title. If you create a table, make sure it has clear column names and is easy to interpret.
Exercise 8: Writing Up Your Findings
Use one graph or table from Exercise 7 and write a short executive brief for an international policy think tank.
Your brief should explain your group’s question, summarise the main pattern you found, and explain why it matters.
Instructions
Length: about 400 words
Exhibit: 1 table or figure
Write for a non-technical audience
Use full sentences in dot points
Do not include code
Do not do any new analysis in this section. Use the evidence you already produced.
For writing style, refer to the “Write Like an Amazonian” document on Canvas.
Format
Use the format below:
Executive Summary
Write 3 to 4 sentences summarising:
the question
the main finding
why it matters
Key Insights
Write up to 3 dot points explaining what the data show.
Policy Implications
Write up to 3 dot points explaining why the findings matter for policymakers or policy organisations.
Recommended Actions
Write up to 3 dot points suggesting clear next steps or actions.