By the end of this lecture, you should be able to:
Distinguish structured, semi-structured, and unstructured data
Explain intuitively the differences between user interface and application program interface (API)
List the advantages of APIs as a means to collect data and apply use APIs to acquire data
12.1 The Business Challenge
Data is often called the “new oil” of the digital economy. Yet, just as not all oil flows neatly through pipelines, not all data comes in ready-to-use formats. Some data is nicely organized in databases, spreadsheets, or CSV files, while other data appears in messy, irregular, or unconventional formats.
So far in this subject, we have mainly worked with structured data. In practice, however, such well-organized data is something of a luxury. It is costly to collect, standardize, and store, and organizations often invest heavily in information systems just to maintain it. At the same time, vast amounts of potentially valuable data exist outside these traditional formats.
In this chapter, we will explore structured, semi-structured, and unstructured data. You will learn the defining characteristics of each category, why they matter for business analytics, and how analysts can collect and make use of data from different sources.
12.2 Structured, Semi-structured, and Unstructured Data
Just as crude oil must be refined before it becomes useful, raw data only generates value when it can be systematically collected, processed, and analyzed. Depending on how well-defined its format is, data is typically classified into three broad categories:
Structured data
Unstructured data
Semi-structured data
Structured Data
Structured data, as the name suggests, refers to highly organized data that follows a predefined schema. It is stored in formats such as relational databases, spreadsheets, and CSV files, where every piece of data is clearly defined and fits into a particular row and column. Most data that we have worked with so far in this subject belongs to this category.
A well-defined structure for data often consists of:
Rows (Records, Observations): Each row corresponds to one unique instance of the entity being described (e.g., a single customer, a single sales transaction).
Columns (Fields, Variables): Columns represent attributes or characteristics of the entity (e.g., customer age, transaction amount).
Tables: Data is organized into tables that can be linked to each other through shared variables or keys (e.g., linking different customer data using customer ID).
Because of their consistency, structured datasets can be easily managed and analyzed with tools like SQL, R’s Tidyverse, or Python’s Pandas. These tools allow us to perform operations easily on structured data by doing filtering, grouping, aggregating, merging, and reshaping data efficiently.
Examples
Banking transaction records (date, amount, sender, recipient)
Inventory databases (product ID, stock quantity, unit price)
Employee HR records (employee ID, salary, department)
Structured data’s strength lies in its precision and reliability. However, it is also limited in scope and volume. Many real-world data and information, such as consumer opinions, video content, or machine logs, cannot be easily captured in neat rows and columns.
Unstructured Data
In reality, structured data represents only a small fraction of all data. As predicted by International Data Corporation (IDC) and Seagate, the global data endowment will continue growing exponentially and reach 163 zettabytes (i.e., 1 trillion gigabytes) by the end of 2025, and among all data, more than 80% is unstructured.
Unstructured data, in stark contrast to structured data, lacks a fixed or consistent schema. It is generated in diverse formats and is often messy, irregular, or context-dependent. Broadly speaking, all information that we generate, unless it comes with predefined structures, belongs to unstructured data. Unlike structured data, it cannot be easily represented in tables of rows and columns.
Examples
Textual data: emails, chat messages, social media posts, product reviews.
Multimedia data: images, audios, videos, surveillance cameras recordings, readings/images from medical devices.
Unstructured data is pervasive because much of the information we generate in daily life does not naturally follow a rigid, tabular structure. It posits significant challenges in collecting and processing data:
Lack of standardization.
Unstructured data is in very different categories. Even within the same category, unstructured data may be recorded in different formats.
Computers and programs cannot easily tackle data without standardized formats and patterns,
Volume and variety
The scale of unstructured data is massive: billions of social media posts per day, countless images and videos uploaded online, etc.
The variety of formats requires specialized tools for each type (e.g., natural language processing for text, computer vision for images).
Collection challenges
Unstructured data is often dispersed across platforms and devices. Collecting it may involve:
Massive manual work (i.e., human labelling of data, CAPTCHA)
Web scraping (can be technically challenging and sometimes legally restricted)
Large storage infrastructures (extremly costly)
Processing requirements
Before analysis, unstructured data usually must undergo heavy preprocessing and transformation (e.g., machine learning algorithms and AI):
Text: tokenization, stop-word removal, stemming, sentiment analysis (note that Google BERT and GPT have largely facilitated the processing of texts and natural language)
Between these two extremes is semi-structured data. Different from structured data, semi-structured data does not fit neatly into tables, but it still carries markers or tags that provide some structure. Compared with purely unstructured data, semi-structured data has a flexible but interpretable format.
Examples
JSON file (as we discussed in the previous week).
HTML documents: HTML tags (<div>, <h1>, <p>) give structures, but the actual text and images inside the tags are unstructured.
Emails: contain structured fields (sender, recipient, timestamp, subject) but also unstructured free-form text in the body.
Chat logs: often include metadata (user ID, time) plus free text content.
Summary
Structured data is easy to store, query, and analyze, but costly to maintain and often limited in scope.
Unstructured data is abundant and rich but difficult to process without specialized tools.
Semi-structured data sits in the middle: it provides some structure but still requires significant preprocessing before analysis.
Remark
For structured data, we have databases to manage them and apply to them relatively standardized data wrangling tools (e.g., dplyr). However, we do not have such luxury for less structured data. The ways of collecting and using less structured data usually vary across different cases. Do not be panic if the methods that we use in our examples do not work in a different case. What matters to us is to get familiar with the workflow.
12.3 Getting Data Using Application Program Interface (API)
In library, we may not always be able to go and look for the books directly - especially when some books may be stored in the staff-only area or difficult to find. In such cases, we can submit requests to a librarian, and the librarian gets the books for us from the shelves.
This process is a nice analogy to what happens when we acquire data via application program interface (API):
You are a data user
The librarian is the API
The staff-only area is the internal system or database you don’t have access to
The request for a book is the message your program sends to the API
The book the librarian brings back is the data or service the API provides
Definitions of API
Technically speaking, Application program interface (API) builds a “connection” between computer and computer programs and specifies a standard set of rules or procedures that an application program will do. This is in contrast to user interfaces (UI) where users directly interact with computers or programs.
Example: Uber’s Usage of Google Map API
We use web-based Google Map or Google Map App to search for locations and navigate to places we want to go. In such scenarios, we, as human users, are directly interact with the user interface (UI) of Google Map to acquire data and information.
As a giant in ride-share and door-to-door delivery services, Uber uses Google Map to obtain data and information that they need for their apps (e.g., map, address information, routes, estimated time of arrival, etc.). In contrast to our usage of Google Map, Uber gets access to Google Map data by using programs to interact with Google Map’s application program interface (API). This allows Uber to get real-time data from Google Map and use such data in real time. We cannot imagine Uber would be able to kick start its business without the Google Map API.
Of course, the access to Google Map via API is not free. Uber pays millions to Google in order to use their services.
Advantages of APIs
Nowadays, many data providers set up APIs as a preferred way to provide data to data users, and there are several advantages of doing so:
Controlled Access. APIs allow data providers to share only selected data or functions, rather than giving users full access to internal systems and data servers. This helps protect sensitive data and system integrity.
Standardization. By providing a standardized way to access data, APIs reduce the need to handle custom requests or build separate solutions for each user, thus reducing the costs.
Innovation and Ecosystem Growth. APIs allow third-party developers to build new apps or services that enhance the value of data (e.g., Google Maps used in ride-sharing apps).
Usage Tracking and Monetization. APIs make it easy to monitor who is using the data and how often. This opens up possibilities for charging based on usage or offering premium tiers.
Workflow of Using APIs to Acquire Data
Most APIs are user-friendly because the rules and protocols of data acquisition have already been set up by data providers. Although every API may differ significantly from the other in many ways. Standard workflow still applies to the usage of APIs.
We are going to use the “Music to Scrape” website (https://music-to-scrape.org/) to walk through the usage of APIs.
Read User Manuals
APIs are structured differently for different functions and needs. In practice, every API may have its own predefined functions and specific requirements for inputs and outputs. Therefore, it is rather important to carefully read through API user manuals before starting using it. Additionally, data providers often provide detailed user manuals and sample program for their APIs. As a user, it would be useful to go through these materials (if available).
Obtaining API Key
In many cases, the access to APIs is limited - you need to get an API key so that you can use the API. We can think API key as the log-in credentials for programs. Data providers can use API keys to track the download and the usage of data and prevent unwarranted access to their data.
For the simple website we are working on, there is no need to obtain an API key. In the workshop, we are going to apply for a free API key from the Federal Reserve Economic Data (FRED).
Using APIs
12.4 Setup
Every day we interact with systems that exchange data automatically: Spotify, banking apps, weather services, maps, financial platforms, and AI systems.
Most of these systems communicate through APIs. Websites are usually designed for humans who want to gather information; APIs are designed for software systems that gather information. In other words, APIs are standardized machine interfaces for structured data exchange.
APIs usually return structured transport data, not analysis-ready tables. After we have called an API, we need to do the following to the response we receive:
parse it as a JSON response;
nest it as an R list;
rectangle it as an R table;
clean it as a tidy table;
store it as a database table.
Before we start working with today’s API, we need to get set up:
library(httr)library(jsonlite)library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.2.0 ✔ readr 2.1.6
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.2 ✔ tibble 3.3.1
✔ lubridate 1.9.5 ✔ tidyr 1.3.2
✔ purrr 1.2.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ purrr::flatten() masks jsonlite::flatten()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
api_url <-"https://api.music-to-scrape.org"
httr helps us communicate with APIs.
We store our base URL as an object because we will repeatedly point our API to it, plus some extension depending on the specific data we are after.
12.5 Request weekly top tracks
Let’s actually ask a server for some data. To do so, we make an API request:
We get more than the data we are after - we also get metadata, which is information about our request and the data it returns (i.e., information on the server, content type, encoding, etc).
The following may be a helpful way to think about APIs: the response object is like an envelope, while the body is the letter inside. The metadata are information written on the envelope.
But we don’t yet have a dataset - we have a response object from the server that we need to operate on.
12.6 Inspect the JSON response
We need to first extract the body of this response as text - i.e., the JSON text that contains the music data. For our purposes, we can leave the rest behind because we don’t need the metadata, headers, status code, etc.
Let’s now inspect the structure text to ensure we have raw JSON:
substr(json_text, 1, 500)
[1] "{\"unix_start\":1777593600,\"unix_end\":1779076832,\"chart\":[{\"name\":\"Un vestido y un amor (en vivo)\",\"artist\":\"Andres Calamaro\",\"plays\":23},{\"name\":\"Ruby Don't Take Your Love To Town\",\"artist\":\"Ed Bruce\",\"plays\":22},{\"name\":\"Orbit\",\"artist\":\"Megadrums\",\"plays\":21},{\"name\":\"Intro\",\"artist\":\"Styles P\",\"plays\":21},{\"name\":\"On Independence Day\",\"artist\":\"Anti-Flag\",\"plays\":21}]}"
This is just like the JSON file we used last week: braces, brackets, keys, values, and nesting. The main difference is that our text is reading from left to right, rather than top to bottom.
But this is not yet a table that we can analyse. We ultimately need to move from this hierarchical structure to a rectangular one.
Here we have taken the raw JSON text and turned it into a nested R list, which we should inspect:
typeof(top_tracks_list)
[1] "list"
We’ve confirmed our parsed JSON is a nested list in R. Let’s then look at the top-level components of this object to better understand what we have:
names(top_tracks_list)
[1] "unix_start" "unix_end" "chart"
str(top_tracks_list, max.level =2)
List of 3
$ unix_start: int 1777593600
$ unix_end : int 1779076832
$ chart :List of 5
..$ :List of 3
..$ :List of 3
..$ :List of 3
..$ :List of 3
..$ :List of 3
chart is a list of five lists, where each of these lists is a track. Let’s look at the first of these:
top_tracks_list$chart[[1]]
$name
[1] "Un vestido y un amor (en vivo)"
$artist
[1] "Andres Calamaro"
$plays
[1] 23
Now we are drilling into the actual records returned by the API. One track is itself a structured object - it contains multiple pieces of information that we need to pull out and place down columns.
12.7 Rectangle the top tracks
Before we can unnest our data, we need to wrap it in a tibble. If we skip this, our R functions for unnesting can’t operate on the data:
unnest_wider() takes the named elements inside each nested record and expands them into columns:
head(top_tracks_raw)
# A tibble: 5 × 3
name artist plays
<chr> <chr> <int>
1 Un vestido y un amor (en vivo) Andres Calamaro 23
2 Ruby Don't Take Your Love To Town Ed Bruce 22
3 Orbit Megadrums 21
4 Intro Styles P 21
5 On Independence Day Anti-Flag 21
This looks familiar - we now have rectangular data: familiar columns, familiar rows, and familiar values, all where we want them to be so that we can analyze them.
Let’s do some light analysis to see how regular tidyverse workflows now work naturally:
top_tracks_raw |>count(artist, sort =TRUE)
# A tibble: 5 × 2
artist n
<chr> <int>
1 Andres Calamaro 1
2 Anti-Flag 1
3 Ed Bruce 1
4 Megadrums 1
5 Styles P 1
top_tracks_raw |>slice_max(order_by = plays)
# A tibble: 1 × 3
name artist plays
<chr> <chr> <int>
1 Un vestido y un amor (en vivo) Andres Calamaro 23
This is a big deal: we have transformed hierarchical transport-oriented data into rectangular analytical data.
12.8 APIs as interactive programmable data systems
Up to now, you’ve made one request, inspected one response, parsed one JSON structure, and rectangled one dataset. But to really see the power of APIs, you need to observe how the request itself is flexible and programmable.
We can use different endpoints to return different kinds of data. We can use different parameter values to change the ‘window’ of data we return. All while still using the same workflow pattern that we employed above.
To see this, let’s first vary the query parameters. Before we looked at top tracks for one weekly window. Now, let’s ask for a window starting in an earlier year:
# A tibble: 5 × 2
name plays
<chr> <int>
1 SNOWPATROL 65
2 Stevie Ray Vaughan 57
3 The Jackson Southernaires 50
4 Mario Rosenstock 49
5 Muse 44
A different endpoint has returned a different data set - but our workflow remained stable.
12.9 Where to from here?
Today we looked under the hood of how modern systems exchange structured data. The key lesson was not this specific music API, but the broader workflow:
API request → response object → JSON text → R list → rectangular table → examination.
We saw that APIs typically return structured transport data rather than analysis-ready datasets, which is why analysts must inspect the response, parse JSON into R objects, locate the relevant records, and rectangle nested structures into tabular form.
We also saw that APIs are designed to be flexible and reusable. Endpoints define the type of resource we want, while query parameters refine the request and change the returned data. The important skill is therefore not memorizing functions or endpoints, but learning how to inspect unfamiliar structured data and progressively transform it into analytical form.
This is exactly the workflow we will use in the FRED tutorial when working with live macroeconomic data.