
7 Storing and Using Existing Data
By the end of this lecture, you should be able to:
- Identify common data storage formats (CSV, JSON, SQL, Parquet) and explain when each is appropriate for business use.
- Load, explore, and manipulate datasets stored in flat files and nested JSON using R.
- Use unnesting and reshaping techniques to convert nested or redundant data structures into tidy formats.
- Read and query data stored in databases using
duckdb
anddplyr
. - Evaluate trade-offs between different storage formats in terms of scalability, structure, and analytical convenience.
- Apply best practices for storing, documenting, and sharing data within an organization.
7.1 The Business Challenge
Online marketplaces like eBay host millions of auctions each year, connecting buyers and sellers across countless product categories. Behind the scenes, this generates complex, nested, and high-volume data. For a business analyst, turning this raw data into useful insights—and storing it in a way that’s efficient, accessible, and reusable—is a foundational challenge.
Suppose you’ve just joined the data team at this fast-growing online marketplace. Right now, every team — from marketing to data engineering and customer support — is handling auction data in different ways: some use Excel sheets, others parse logs from an API that stores information in a format called JSON, and data engineers keep things in a database (but no one else knows how to read them!).
This chaos is slowing everyone down. Your first task as a new analyst is to propose a better data storage solution that balances usability, scalability, and efficiency for teams across the business.
But how do you choose the right format? Should you push everyone to use Excel spreadsheets? Why not convince everyone to move everything into a database?
Before you decide, you’ll need to understand the strengths and trade-offs of each major data storage format. That is what we will cover in this chapter.
About the Data
To guide your decision, we’ll work with a sample dataset of online auctions. Each auction record includes:
- Item metadata: ID, title, category, seller info
- Seller details: user ID, rating
- Bidding history: a nested list of bids, each with a bidder ID, amount, and timestamp
We have stored and structured the dataset in multiple formats — Excel, flat files (CSV), JSON and DuckDB — so you can compare how different formats impact usability and performance. Along the way, you’ll learn not just how to load data, but how to think about storing and transforming it for long-term use in a business setting.
Which of the data formats that we will cover in this chapter have you already heard of?
- Excel
- CSV files
- A database
- JSON
Where did you learn about the formats you are aware of? What limitations have you found when using them?
7.2 Spreadsheets (Excel & Friends)
For many of us, spreadsheets like Microsoft Excel or Google Sheets are the default way we have worked with data until starting our business analytics journey.
Reading Spreadsheets in R

The figure above shows what the spreadsheet we’re going to read into R looks like in Excel. We see that there is data across three separate sheets:
auctions
contains information about the each auction,bids
contains information about bids in each auction; andsellers
contains information about the seller of each item
To load the spreadsheet data into R, we use the readxl
package, which allows us to read .xlsx files directly without requiring Excel to be installed.
# A tibble: 6 × 8
item_id title category start_time end_time seller_user_id
<chr> <chr> <chr> <dttm> <dttm> <chr>
1 AUC00000 Sams… Electro… 2025-07-19 00:00:00 2025-07-20 00:00:00 pray
2 AUC00001 Casi… Fashion 2025-04-24 00:00:00 2025-04-27 00:00:00 vgomez
3 AUC00002 Loui… Fashion 2025-03-20 00:00:00 2025-03-22 00:00:00 rodrigueznich…
4 AUC00003 The … Books 2025-03-24 00:00:00 2025-03-26 00:00:00 tylercook
5 AUC00004 Mono… Toys & … 2025-08-14 00:00:00 2025-08-15 00:00:00 benjamin55
6 AUC00005 Dyso… Home & … 2025-06-25 00:00:00 2025-06-28 00:00:00 khuff
# ℹ 2 more variables: final_price <dbl>, winner_id <chr>
library(readxl)
# Read a spreadsheet file
<- read_excel("data/ebay_auctions.xlsx")
auctions_excel
# Peek at the first few rows
head(auctions_excel)
We can see that by default, R loads the auctions
sheet which is the first sheet in the Excel workbook. If we want to specify the sheet we want to load, we can do that as follows:
# A tibble: 6 × 4
item_id bidder_id bid_amount bid_time
<chr> <chr> <dbl> <dttm>
1 AUC00000 istevens 94.6 2025-07-19 00:00:00
2 AUC00000 mendozajasmine 97.7 2025-07-19 00:00:00
3 AUC00000 james22 103. 2025-07-19 00:00:00
4 AUC00000 wnunez 119. 2025-07-19 00:00:00
5 AUC00000 paynejoshua 124. 2025-07-19 00:00:00
6 AUC00000 yolanda40 127. 2025-07-19 00:00:00
# Read a specific sheet
<- read_excel("data/ebay_auctions.xlsx", sheet = "bids")
excel_bids
# Or use sheet number
<- read_excel("data/ebay_auctions.xlsx", sheet = 2)
excel_bids
head(excel_bids)
Issues with Spreadsheets
Spreadsheets are widely used in business and education, and offer an approachable interface for organizing tables, applying formulas, and making quick charts. This makes spreadsheets a great entry point into the world of data—but also a format with serious limitations when it comes to analytics at scale.
Spreadsheets don’t record your steps, meaning you can’t always retrace how a certain number or figure was calculated. There’s no built-in version history (at least not one that’s easy to audit), and formulas often vary invisibly from cell to cell—especially if you’ve copied or dragged them across a range. This makes it difficult to ensure consistency and nearly impossible to reproduce or validate your results.
Even small errors—like accidentally referencing the wrong column—can go unnoticed. A single cell with a broken formula can distort an entire analysis. These issues are especially problematic when figures or charts are being used for decisions or presentations.
To make matters worse, analysts often inherit spreadsheets from others—colleagues, clients, or external partners. These files might contain:
- inconsistent naming conventions
- manual totals
- hidden formatting or comments
- merged cells
- duplicated values across tabs
all of which make using them for further analysis difficult.
Despite these flaws, spreadsheets remain a powerful tool for quick exploration and lightweight collaboration. They’re useful for small teams, early prototyping, or downloading data from platforms like Shopify, Facebook Ads, or Google Analytics. But when your work moves toward reproducibility, scalability, or automation, it’s time to adopt formats designed for analysis.
Writing to Excel
Despite not wanting to use Excel files as an output, we might want to export the final results for an Excel workbook to pass on to a less technical colleague. To do this, we need the library writexl
to save datasets to an Excel format. Let’s create a small data frame that we can then write out to a file. Specifically, lets filter all bids from the auction with the ID AUC00001
and save them in a file names small_bids.xlsx
:
library(tidyverse)
library(writexl)
colnames(excel_bids)
<-
bids_to_export |>
excel_bids filter(item_id == "AUC00001")
write_xlsx(bids_to_export, "small_bids.xlsx")
Google Sheets is another widely used spreadsheet program. It’s free and web-based. Just like with Excel, in Google Sheets data are organized in worksheets (also called sheets) inside of spreadsheet files.
You can load data from a Google Sheet with the googlesheets4
package. This package is non-core tidyverse as well, you need to load it explicitly.
The first argument to read_sheet()
is the URL of the file to read, and it returns a tibble: https://docs.google.com/spreadsheets/d/1V1nPp1tzOuutXFLb3G9Eyxi3qxeEhnOXUzL5_BcCQ0w. These URLs are not pleasant to work with, so you’ll often want to identify a sheet by its ID.
library(googlesheets4)
gs4_deauth()
<- "1V1nPp1tzOuutXFLb3G9Eyxi3qxeEhnOXUzL5_BcCQ0w"
sheet_id <- read_sheet(sheet_id) google_df
You can also write from R to Google Sheets with write_sheet()
. The first argument is the data frame to write, and the second argument is the name (or other identifier) of the Google Sheet to write to:
write_sheet(bids_to_export, ss = "bids_auc10001")
If you’d like to write your data to a specific work-sheet inside a Google Sheet, you can specify that with the sheet argument as well.
write_sheet(bake_sale, ss = "bids_auc10001", sheet = "bids")
While you can read from a public Google Sheet without authenticating with your Google account and with gs4_deauth()
, reading a private sheet or writing to a sheet requires authentication so that googlesheets4
can view and manage your Google Sheets.
When you attempt to read in a sheet that requires authentication, googlesheets4 will direct you to a web browser with a prompt to sign in to your Google account and grant permission to operate on your behalf with Google Sheets. However, if you want to specify a specific Google account, authentication scope, etc. you can do so with gs4_auth()
, e.g., gs4_auth(email = "mine@example.com")
, which will force the use of a token associated with a specific email. For further authentication details, we recommend reading the documentation googlesheets4
auth vignette: https://googlesheets4.tidyverse.org/articles/auth.html.
7.3 Flat Files (CSV, TSV and other delimited files)
When we need to store and share structured data reliably, especially in a way that can be reproduced by code, an alternative format often works better: flat files. Flat files look similar to spreadsheets on the surface—just rows and columns of data—but under the hood they behave very differently. A flat file is stored as plain text. Each line represents a row of data, and the values are separated by a special character like a comma or tab. There are no formulas, no styling, no hidden or merged cells - just the raw information. There is another difference compared to Spreadsheets like Excel, flat files store each data set as a separate file, so there is no concept of multiple sheets if information stored in the same file.
One of the most common flat file types is the CSV, which stands for Comma-Separated Values. You have already encountered files with a .csv extension in the earlier chapter in this book, all the data we have worked with so far has been stored as CSV files.
For our auction example this means there are 3 CSV files, one each for auctions
, bids
and sellers
Because CSVs simply store raw information, we can see the content of the file when we open it in a text editor. Here’s what the first few lines of bids.csv
looks like when opened in a text editor:
We can see that we can view and read the data with our own eyes. As such flat files are easy to inspect manually. The first row, commonly called the header row, gives the column names, and the following six rows provide the data. The columns are separated, aka delimited, by commas. They are also easy to edit, as we can manually add or edit rows and save the results if desired. These are not the only advantages of flat files, they are supported by nearly every data analytics tool so they are extremely portable which makes them a common default choice when exporting data from software such as R.
Not all files can be viewed as cleanly as flat file formats such as csv in a text editor. If we tried to do the same thing, and view the first few lines of ebay_auctions.xlsx
in a text editor, we will get the following output:
That’s not so useful!
7.3.1 Reading Flat Files in R
In R, we can use the readr
package to import flat files quickly and reliably. We can read this file into R using read_csv()
. The first argument is the most important: the path to the file. You can think about the path as the address of the file: the file is called bids.csv
and it lives in the data
folder.
library(readr)
# Read a CSV file into a tibble
<- read_csv("data/bids.csv") bids_flat
Rows: 13414 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): item_id, bidder_id
dbl (1): bid_amount
dttm (1): bid_time
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
When you run read_csv()
, it prints out a message telling you the number of rows and columns of data, the delimiter that was used, and the column specifications (names of columns organized by the type of data the column contains). It also prints out some information about retrieving the full column specification and how to quiet this message.
The character that separates columns in a flat file is called a delimiter. Common ones include:
- Comma (,) — for .csv files
- Tab ( — for .tsv files
- Pipe (|) — occasionally used for exports from databases
If we wanted to load the the file bids.tsv
which uses Tab separated values:
head ../data/bids.tsv
We would use the read_delim()
function and specify that columns are tab separated:
<- read_delim("data/bids.tsv", delim = "\t") bids_tab
7.3.2 Writing to a file
readr
also comes with functions for writing data back to disk: write_csv()
and write_delim()
. The most important arguments to these functions are the data frame we want to save and the location where we want to save the file. Suppose we want to take the bids from AUC0001
and save them in a CSV file bids_auc0001.csv
located in the data directory:
<-
auctions_filtered |>
bids filter(item_id == "AUC0001")
write_csv(auctions_filtered, "data/bids_auc0001.csv")
7.4 JSON Files: When rectangular data might not be enough
Flat files are great when every observation fits neatly into one row. But sometimes our dataset comes with a sort of hierarchy, its mini-datasets nested inside a larger dataset. In our bids
data, for example, one could imagine rather than having each row of data being one bid from one auction, and storing the item_id
in each row along with the bid information, we could think about storing the set of bids for each auction as a mini-dataset attached to each item ID.
To handle these cases, we can use nested data formats — and one of the most popular is JSON. JSON (JavaScript Object Notation) is a format designed to store structured data. It can be especially useful when one record contains multiple layers of information—for example, an auction that has several bids attached to it as we have here.
Unlike flat files, JSON allows us to store all related data in one object, without repeating fields. It’s human-readable, flexible, and widely used in real-world business data systems.
JSON stores data as a set of key-value pairs. Think of it like a list of labeled boxes — each label is the name of a variable, and inside each box is the value. In the context of our auction dataset, one auction might look like this in JSON format:
[
{
"item_id": "AUC00000",
"title": "Samsung Galaxy S22",
"category": "Electronics",
"start_time": "2025-07-19T08:04:31.226153",
"end_time": "2025-07-20T08:04:31.226153",
"seller": {
"user_id": "pray",
"rating": 4.63,
"country": "AE"
},
"bids": [
{
"bidder_id": "istevens",
"amount": 94.61,
"time": "2025-07-19T10:41:31.226153"
},
{
"bidder_id": "mendozajasmine",
"amount": 97.73,
"time": "2025-07-19T13:27:31.226153"
},
{
"bidder_id": "james22",
"amount": 103.28,
"time": "2025-07-19T16:10:31.226153"
},
{
"bidder_id": "wnunez",
"amount": 119.04,
"time": "2025-07-19T16:34:31.226153"
}
],
"final_price": 119.04,
"winner_id": "wnunez"
}
]
Let’s break it down:
- The top-level keys (
item_id
,title
,category
) describe the auction itself. - The
seller
key contains a nested object with seller details. - The
bids
key holds a list of bid objects, each with its own bidder, amount, and time.
This structure keeps all related information grouped together. That’s (one of) the power(s) of JSON: it stores complex relationships without flattening the data.
Just because JSON data can feature nesting structures does not mean that it has to. JSON data can have a flat structure. An example of a flat JSON dataset, that stores key value-pairs for two auctions is:
[
{
"item_id": "AUC10001",
"title": "Nike Air Force 1",
"category": "Fashion",
"start_time": "2025-08-15T09:00:00",
"end_time": "2025-08-17T09:00:00",
"seller_user_id": "alvarez",
"seller_rating": 4.87,
"seller_country": "US",
"final_price": 145.50,
"winner_id": "sophiaw"
},
{
"item_id": "AUC10002",
"title": "LEGO Star Wars Millennium Falcon",
"category": "Toys & Games",
"start_time": "2025-08-20T14:30:00",
"end_time": "2025-08-22T14:30:00",
"seller_user_id": "zhangwei",
"seller_rating": 4.95,
"seller_country": "CN",
"final_price": 349.99,
"winner_id": "marklee"
}
]
Here the data is just a flat list of records, where each attribute sits side-by-side in a similar manner to a row of a flat file.
Working with Nested JSON Auction Data in R
Next, we’ll learn how to work with nested JSON data. Each auction record includes top-level information (like item title and category), a nested list of seller information, and a nested list of bids placed by users. Our goal is to explore how to load and manipulate this nested structure using our tidyverse tools. We’ll move step by step, from the raw JSON to a tidy dataset.
Step 1: Load the JSON File
We’ll start by reading in the data using the read_json()
function from the jsonlite
package. This preserves the nested list structure exactly as it appears in the JSON file.
library(jsonlite)
<- read_json("data/ebay_auctions.json")
auctions_nested
typeof(auctions_nested)
The object auctions_nested
is a list of auction records.
Step 2: Peek Inside a Single Record
Let’s take a closer look at the structure of the first auction record. If we inspect the first element of the list, can see the keys associated with the auction:
names(auctions_nested[[1]])
We look at one of the top-level lists, such as the title
which stores the type of product being auctioned:
1]]$title auctions_nested[[
And if wanted to peek into the one of the the keys that contains a nested object, such as bids
, we can see the names of the keys inside the bids list:
names(auctions_nested[[1]]$bids[[1]])
And if we wanted to see the amount bid:
1]]$bids[[1]]$amount auctions_nested[[
Thus, what we see here is when we load the data into R using read_json
the hierarchical structure is preserved.
Step 3: Wrap the List into a Tibble
To start working with this in tidyverse pipelines, we’ll wrap the list of auctions into a tibble with one column:
<- tibble(record = auctions_nested) auctions_tbl
Each row of auctions_tbl
now contains one full auction record:
head(auctions_tbl)
and there are 1000 rows of auction data:
nrow(auctions_tbl)
7.4.0.1 Step 4: Rectangling non-rectangular data
Now our task is to enagage in “data rectangling”: taking data that is fundamentally hierarchical, or tree-like, and converting it into a rectangular data frame made up of rows and columns. To start rectangling the data, we want to take the top level keys from our JSON data and convert them to columns of data. We make this conversion by unnesting the lists we have stored in the record
column, and making our data wider, thus the function we want to use is unnest_wider()
. This process of removing hierarchy is also known as flattening the data. Let’s do it:
<-
auctions_tbl |>
auctions_tbl unnest_wider(record)
glimpse(auctions_tbl)
Now we have columns like item_id
, title
, category
, seller
, and bids
, and our data looks more rectangular than it did prior to this unnesting step. However, both seller
and bids
are still lists, we need to engage in further flattening.
When each row has the same number of elements with the same names, like with seller
, it’s natural to further widen the data and put each component into its own column with a second application of unnest_wider()
. When we do, the new columns will have a two-part name, seller-<SOMETHING>
, because we are unnesting a list from the seller column. The
<-
auctions_tbl |>
auctions_tbl unnest_wider(seller, names_sep = "_")
glimpse(auctions_tbl)
Now seller information like user_id
and rating
is available as separate columns.
Step 5a: Flattening the Bid Lists to Columns
What about the bids
column? Each auction has its own list of bids, and each bid is a small list too. A natural next step, given what we have done above would be to break it out into wide format:
<-
auctions_tbl_wide |>
auctions_tbl unnest_wider(bids, names_sep = "_")
glimpse(auctions_tbl_wide)
This gives us one column for each bid: bids_1
, bids_2
, etc. But these are still lists. We start to unnest them as well:
<-
auctions_tbl_wide |>
auctions_tbl_wide unnest_wider(bids_1, names_sep = "_", names_repair = "unique") %>%
unnest_wider(bids_2, names_sep = "_", names_repair = "unique")
This gets tedious if there are many bids (and tough if we didn’t know the number of bids per auction). This might suggest that continuing to widen the data is not the ‘right’ approach. As we’ve advocated throughout the book, when we have to manually repeat actions, the structure of the data we are working with, or trying to create, is likely not the most natural one.
Suppose you really wanted to keep widening the data via unnest_wider()
even though the approach might now be first-best. How would we do it? We’d need to follow these steps:
- Determine the names of all the
bids_
columns, ideally in an automated way that R can do for us - Apply a function that takes all of these column names in the
auctions_tbl_wide
data and applyunnest_wider
on them.
Here’s how you can do that. It uses the reduce()
function in Step 2 to apply a function iteratively that unnest’s each of the columns we identified in the first.
# Step 1: Find all column names that
# we want to apply unnest_wider() to
<-
bid_cols |>
auctions_tbl_wide select(starts_with("bids_")) |>
select(where(is.list)) |>
names()
<-
auctions_tbl_wide reduce(
bid_cols,.init = auctions_tbl_wide,
.f = function(df, col) {
unnest_wider(df,
!!sym(col),
names_sep = "_",
names_repair = "unique"
)
}
)
glimpse(auctions_tbl_wide)
Don’t worry, we won’t expect you to remember those steps. That’s a pretty advanced pipeline. But it shows this route can be pursued if need be.
Step 5b: Unnesting bids
to new rows
So far, our rectangling has made the data wider. But rectangles have two dimensions, width and length. Thus, instead of going wider, we can go longer. This means we unnest the bids data into new rows, with each new row being one bid, via the unnest_longer()
function. The beauty in this is that we do not need to know the number of new rows we need to create per auction:
<-
auctions_tbl_long |>
auctions_tbl unnest_longer(bids)
glimpse(auctions_tbl_long)
Now we have a tidy table where each row is one bid and it includes the item ID and other metadata. However, the bids column is still a list. But now each list is simply the of one bid. We can unnest bids
list wider, to get each of the bid’s characteristics as new columns:
<-
auctions_tbl_long |>
auctions_tbl_long unnest_wider(bids, names_sep = "_")
glimpse(auctions_tbl_long)
7.5 Databases with DuckDB
A huge amount of business data lives in databases, so it’s important that you know how to access it. The reason lots of business data lives inside of a database, is that databases are efficient at storing the data and there is a widely used language, known as SQL (Structured Query Language), for performing data wrangling on databases. Luckily for us, our knowlegde of R and the tidyverse in particular, means we have most of tools we need to work with data that lives in pre-existing databases. This is because the packages within the tidyverse know how to translate and run dplyr
commands on a database. This means that once we know how to connect to a database, we can apply the knowledge from previous chapters to extract insights with relative ease.
Databases are run by database management systems (DBMS’s for short), which come in three basic forms:
- Client-server DBMS’s run on a powerful central server, which you connect to from your computer (the client). They are great for sharing data with multiple people in an organization. Popular client-server DBMS’s include PostgreSQL, MariaDB, SQL Server, and Oracle.
- Cloud DBMS’s, like Snowflake, Amazon’s RedShift, and Google’s BigQuery, are similar to client server DBMS’s, but they run in the cloud. This means that they can easily handle extremely large datasets and can automatically provide more compute resources as needed.
- In-process DBMS’s, like SQLite or duckdb, run entirely on your computer. They’re great for working with large datasets where you’re the primary user.
Connecting to a database
To connect to a database from R, you’ll use a pair of packages:
DBI
as a database interfaceduckdb
as database engine that stores and queries data in the database
Whenever you work with a database in R, you’ll always use DBI
(database interface) because it provides a set of generic functions that connect to the database, upload data, and run queries. duckdb
is a high-performance database that’s designed very much for the needs of a data scientist. We use it here because it’s very easy to get started with, but it’s also capable of handling gigabytes of data with great speed. Depending on your some specifics of the database you are looking to connect to, you may need to use a different engine. However, the the only difference between using duckdb and any other DBMS is how you’ll connect to the database. This makes it great to teach with because you can easily run this code as well as easily take what you learn and apply it elsewhere.
The database we want to connect to is the auctions data is located in the file data/auctions.duckdb
. We want to use the duckdb
as the engine with which we interface with the data:
library(DBI)
library(duckdb)
<- dbConnect(duckdb(), dbdir = "data/auctions.duckdb")
con
print(con)
Great. Now we have a connection called con
. As we move forward, we’ll always have to use this connection to interact with the database. For example, if we want to see which tables of data are stored in our database, we can use the dbListTables()
function:
dbListTables(con)
Which shows we have three distinct tables of data stored. This is kind of like what we had with our Excel spreadsheet, multiple sheets of data living inside one location. If we then want to look at the contents of each table, we can ask R to return to us a glimpse of one of the tables:
tbl(con, "bids") |>
glimpse()
In the code above, we first had to use tbl()
to create an object that represents a database table. Whenever we want to work with a table in the database, we will need to use the tbl()
function with the connection con
as its first argument. You can tell this object represents a database query because it prints the DBMS name at the top, and while it tells you information about each of of columns, it typically doesn’t know the exact number of rows. This object is lazy; and finding the total number of rows of a table stored in a database (or as the result of a set of data-wrangling tasks) could be (very) time consuming. Thus by default, we only see you a small preview of the data — a handful of rows along with the column definitions — while postponing the actual work of counting or retrieving everything until we explicitly ask for it.
When you have finished working with your database, you should close the connection:
dbDisconnect(con)
dbplyr
: dplyr
for databases
Now that we’ve connected to a database and loaded up some data, we can start to wrangle with the data. Data wrangling with databases in the tidyverse uses a package called dbplyr
. dbplyr
is a dplyr
backend, which means that we keep writing dplyr
code but behind the scenes executes it differently. In our database, dbplyr
translates the code to SQL - the language for operating on structured databases.
Let’s now complete a task that is all too familiar in this chapter, selecting all bids from the auction labelled AUC0001
. The dplyr
code that would do that for us is:
<-
bids_filtered tbl(con, "bids") |>
filter(item_id == "AUC00001")
glimpse(bids_filtered)
Hmmmm, nothing gets returned here. This comes back to the lazyness of the tbl()
command. If we want our code to be executed against the database, we have to add a collect()
command as a final step:
<-
bids_filtered tbl(con, "bids") |>
filter(item_id == "AUC00001") |>
collect()
glimpse(bids_filtered)
This matches what we had in previous sections, as expected.
Behind the scenes, dplyr generates the SQL code that is equivalent to what we wrote, then calls dbGetQuery()
to run the query against the database, then turns the result into a tibble.
To see this, we can use show_query()
to see the SQL version of our dplyr
code:
<-
query tbl(con, "bids") |>
filter(item_id == "AUC00001")
show_query(query)
and then run that SQL query against the database:
dbGetQuery(con, "SELECT * FROM bids WHERE (item_id = 'AUC00001')")
which yields exactly the same dataset.
Typically, you’ll use dbplyr
to select the data you want from the database, performing basic filtering and aggregation using the translations described below. Then, once you’re ready to analyse the data with functions that are unique to R, you’ll collect()
the data to get an in-memory tibble, and (possibly) continue your work with pure R code.
To see this in action, let’s wrangle the data in the database to produce a table very similar to the one we produced by flattening the JSON data.
7.5.1 Working with the Auctions database
To start to aggregate our data into one table, we can join the data in the auctions
table to the seller information in sellers
. We will do this using a left_join()
. First let’s look at each of the datasets structures to find a common variable to join on:
tbl(con, "auctions") |>
glimpse()
tbl(con, "sellers") |>
glimpse()
So we can use seller_user_id
as the common variable to join on:
<-
auctions_sellers tbl(con, "auctions") |>
left_join(tbl(con, "sellers"),
by = "seller_user_id")
Notice how we do not use a collect()
statement, because we do not yet need to return the results. We only need the results once we join this data to the bids data. To join auctions_sellers
to the bids data, we use the item_id
as the common key:
<-
res |>
auctions_sellers left_join(tbl(con, "bids"),
by = "item_id") |>
collect()
glimpse(res)
Which has the same rows and columns as the rectangled data from the JSON file.
7.6 Wrap-up: Which Format Would You Use?
Now that we’ve seen four alternative data storage formats:
- Spreadsheets
- Flat Files
- JSON
- Databases
which one do you think is best for the online auction data we have here?
In the next chapter, rather than working with existing data we are going to explore one method of collecting new data called webscraping. We’ll look to harvest data from websites and then store the information for future analysis.