# Load packages for data wrangling and visualization
library(dplyr)
library(tibble)
library(lubridate)
library(ggplot2)
# quantmod is used to get stock price data from Yahoo Finance
library(quantmod)
9 Variation in Cross-Sectional, Times-Series, and Panel Data
By the end of this lecture, you should be able to:
- Differentiate cross-sectional, time-series, and panel data in business and finance contexts.
- Recognize how stock return and financial statement data exemplify each type of data.
- Apply data wrangling tools for date and time.
- Understand basic concepts of stock prices and returns and calculation of relevant metrics.
- Construct trading portfolios and evaluate performance of investment portfolios
9.1 The Business Challenge
Financial markets generate an enormous amount of data every second: prices, trading volumes, financial reports, press releases, news headlines, analyst reports, and even tweets. Behind every movement in the stock market lies information waiting to be uncovered.
In today’s financial market, data-driven investment and trading have become the main stream. Quantitative trading (“quant trading”) is built on the notion that market behavior can be modeled, measured, and predicted using data. Different from trading based on human judgment, quant trading sticks to a way of trading financial assets, such as stocks, currencies, or commodities, using mathematical models and evidence from data analysis, instead of relying purely on human intuition or gut feeling.
Instead of asking:
“Do I think this stock will go up tomorrow?”
Quant traders think about:
“What does the data tell me about the likelihood this stock will go up tomorrow?”
In this chapter, we will step into the role and act as a quantitative trader, analyzing stock prices and experimenting with simple trading strategies. To prepare for this role, we first need to understand how data describes the variation across entities and time. Next, we will work on stock price data to get familiar with how to deal with dates and time stamps effectively. Finally, we will create a mini trading strategy in panel data using past stock price performance.
Before diving into stock prices and financials, it’s essential to understand the type of data we are working with. Different types of data allow us to answer different questions:
Cross-sectional data tells us about differences across entities at one point in time.
Time-series data helps us observe how something evolves over time.
Panel data combines both, letting us see how multiple entities change across time.
When dealing with data in the financial market, recognizing the structure of your data is crucial for selecting the right models and strategies.
9.2 Cross-sectional Data
Data consists of observations collected at a single point in time across multiple subjects or entities (e.g., firms, stocks, households, individuals, cities, countries). This type of data provides a “snapshot” of a population at that moment, showing patterns and differences among various entities
You can consider cross-sectional data as taking a “photo” or “snapshot.” Imagine pausing the market at a time point, and writing down the stock price and return of every constituent firm in the S&P 500. That dataset would let us compare firms against each other at that moment: Which companies outperformed, which lagged behind, and whether certain firm characteristics were linked to higher or lower returns at the time point.
10 min
In this exercise, we are going to get a sense of cross-sectional data by looking at stock trading data of S&P 500 firms as of 31/12/2024:
The Standard and Poor’s 500, or simply the S&P 500, is a stock market index tracking the stock performance of 500 leading companies listed on stock exchanges in the United States. It is one of the most commonly followed equity indices and includes approximately 80% of the total market capitalization of U.S. public companies, with an aggregate market cap of more than $57.401 trillion as of August 29, 2025. (Source: Wikipedia)
Note that the list of firms belonging to the S&P 500 index is not constant. Firms may be added to or dropped from the list due to changes in market capitalization (i.e., the total value of the shares) or merger and aquisition (M&A). In addition, the index may have more than 500 firms because it may include two share classes of its component companies.
- Load the data,
sp500_xs.csv
, that captures stock prices of S&P 500 firms on 31/12/2024. Take a glimpse of the data. In the dataset,prc
is stock price,ret
refers to daily stock return,shrout
standards for the number of shares outstanding, andvol
is the trading volume (the number of shares being traded during that day).
# Read data into R
<- YOUR_CODE("../data/sp500_xs.csv")
sp500_xs
# Take a glimpse
YOUR_CODE
- Calculate total market capitalization (i.e., value of all shares, calculated as stock price * number of shares). Which are the top 5 companies in terms of total market cap? Do you know these firms?
<- sp500_xs %>%
top5 YOUR_CODE(mktcap = YOUR_CODE) %>%
arrange(YOUR_CODE) %>%
YOUR_CODE(5)
top5
Solution
# Read data into R
<- read_csv("../data/sp500_xs.csv") sp500_xs
Rows: 503 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): tic
dbl (5): stock_id, prc, ret, shrout, vol
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Take a glimpse
%>% glimpse() sp500_xs
Rows: 503
Columns: 6
$ stock_id <dbl> 10104, 10107, 10138, 10145, 10516, 10696, 11308, 11403, 11404…
$ tic <chr> "ORCL", "MSFT", "TROW", "HON", "ADM", "FI", "KO", "CDNS", "ED…
$ prc <dbl> 111.70, 397.58, 108.45, 202.26, 55.58, 141.87, 59.49, 288.46,…
$ ret <dbl> 0.063265, 0.057281, 0.007057, -0.035525, -0.230407, 0.067977,…
$ shrout <dbl> 2748922, 7430436, 223938, 652182, 533381, 600186, 4323414, 27…
$ vol <dbl> 1711372, 5256086, 319141, 645959, 1805810, 533327, 2938936, 3…
<- sp500_xs %>%
top5 mutate(mktcap = prc * shrout) %>%
arrange(desc(mktcap)) %>%
head(5)
top5
# A tibble: 5 × 7
stock_id tic prc ret shrout vol mktcap
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 10107 MSFT 398. 0.0573 7430436 5256086 2954192671.
2 14593 AAPL 184. -0.0422 15441881 11805597 2847482702.
3 84788 AMZN 155. 0.0215 10387381 9505252 1612121531.
4 86580 NVDA 615. 0.242 2464000 9656674 1516025329.
5 13407 META 390. 0.102 2200049 3428921 858327139.
9.3 Time-Series Data
Time-series data is a sequence of observations of a single subject or entity collected over time at successive (usually equally spaced) intervals. This type of data provides a trajectory for a single subject or entity over time.
Time-series data varies in an orthogonal dimension compared with cross-sectional data. Instead of taking a snapshot of all stocks on the market, time-series data records the same set of information of a single stock over time. For instance, you may collect daily stock prices of Apple Inc. (ticker: AAPL
) over time and evaluate the ups and downs.
9.3.1 Data Wrangling for Timestamps
Many datasets (especially in finance, business, and analytics) are tied to dates and times (e.g., stock prices, sales transactions, website visits, or macroeconomic indicators).
However, dates and times are often a tricky component in programming:
- They can appear in texts, numbers, or a mix of the two (e.g., December 31, 2024 vs. 12/31/2024)
- The same date/time can be expressed in many different ways depending on convention or users’ habits (e.g., 12/31/2024 vs. 31/12/2024 vs. 2024-12-31; 5:00 PM vs. 17:00)
- Dates and times can be added/subtracted, but the numeral system for time is different (12/24/60 as bases)
- Time zones and daylight saving times
When we work on financial market data, date and times are extremely important. When we analyze stock prices and financial performance, we would like to have data ordered by date and time. Sometimes, we even need to merge datasets by dates and date ranges.
lubridate
is the package in R that offers powerful functions and tools to deal with time stamps and periods. We will be actively using the tools from lubridate
throughout this chapter. You may refer to R4DS for more comprehensive reading of dealing with time stamps and using lubridate
.
9.3.2 Obtain Stock Price Data Using quantmod
In R, there is one powerful package that allows us to easily help us acquire financial market data, quantmod
. getSymbols()
is a powerful function in quantmod
that uses the API to fetch financial market data from various sources (e.g., Yahoo Finance).
quantmod
?
quantmod
stands for Quantitative Financial Modelling Framework. It is mainly used for downloading, charting, and modeling financial time series data such as stock prices, exchange rates, or interest rates.
quantmod
offers a wide range of data sources, including Yahoo Finance and Federal Reserve Economic Data (you may try to use quantmod
to obtain data from FRED).
We are going to use getSymbols()
from quantmod
to get a sense of time-series data by looking at Apple Inc. (AAPL
)’s year-to-date stock prices.
Load required packages and use getSymbols()
to obtain AAPL’s stock prices from 01/01/2025 to now. Check the class of the output. Is the output in the format of tibble?
# Obtain AAPL's stock prices
getSymbols("AAPL",
src = "yahoo",
from = as.Date("2025-01-01")
# to = # If to = is not specified, it fetchs data up till the most recent
)
[1] "AAPL"
head(AAPL)
AAPL.Open AAPL.High AAPL.Low AAPL.Close AAPL.Volume AAPL.Adjusted
2025-01-02 248.93 249.10 241.82 243.85 55740700 242.9874
2025-01-03 243.36 244.18 241.89 243.36 40244100 242.4992
2025-01-06 244.31 247.33 243.20 245.00 45045600 244.1333
2025-01-07 242.98 245.55 241.35 242.21 40856000 241.3532
2025-01-08 241.92 243.71 240.05 242.70 37628900 241.8415
2025-01-10 240.01 240.16 233.00 236.85 61710900 236.0122
class(AAPL)
[1] "xts" "zoo"
as.Date()
as.Date()
is a base function in R. You can use it without loading lubridate
. It converts dates stored in strings into <date>
format. Inside the function, you need to specify the original format of dates in strings.
By default, as.Date()
tackles dates in the format of “yyyy-mm-dd”. If the original date strings are not in this format (e.g., dd/mm/yyyy), you need to specify format = "%d/%m/%Y
in the function. %d
, %m
, and %Y
are regular expressions that standard for a specific way expressing time concepts (see R4DS for more details).
xts
objects
xts
stands for eXtensible Time Series.
- It is a special type of data structure in R built on top of a matrix or data frame, but with dates (or times) as the row index.
- This means, instead of just rows numbered 1, 2, 3…, each row is labeled by a date (or datetime)
It is designed and optimized for time-series data like stock prices, exchange rates, or economic indicators. It makes it very easy to deal with dates and times:
- Subset by time (
AAPL["2022-01"]
gives all January 2022 data) - Merge or align time series of different data categories and frequency
- Allow efficient processing of time-series data and save memory
However, as an object specializing in time-series data, xts
may not be generalizable to other tasks and cannot be directly processed by our tidyverse
tools.
We can convert xts
objects into tibbles so that we are able to apply our data wrangling tools
Convert the output in the xts
class to tibble. Look at date
, is it a date variable?
# Convert xts objects to tibble
<- as_tibble(AAPL, rownames = "date") %>%
aapl_tbl rename_with(~ sub("AAPL\\.", "", .x)) # remove "AAPL." prefix
head(aapl_tbl)
# A tibble: 6 × 7
date Open High Low Close Volume Adjusted
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2025-01-02 249. 249. 242. 244. 55740700 243.
2 2025-01-03 243. 244. 242. 243. 40244100 242.
3 2025-01-06 244. 247. 243. 245 45045600 244.
4 2025-01-07 243. 246. 241. 242. 40856000 241.
5 2025-01-08 242. 244. 240. 243. 37628900 242.
6 2025-01-10 240. 240. 233 237. 61710900 236.
# Check the type
class(aapl_tbl$date)
[1] "character"
# Convert into dates
<- aapl_tbl %>%
aapl_tbl mutate(date = as.Date(date))
# Check the type again
class(aapl_tbl$date)
[1] "Date"
Visualize the price trends using ggplot
:
ggplot(aapl_tbl, aes(x = date, y = Close)) +
geom_line(color = "blue") +
scale_x_date(date_breaks = "1 months", date_labels = "%b-%Y") +
labs(title = "AAPL Stock Price (2025)",
x = "Date", y = "Adjusted Close Price ($)") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
It is important that the variable, date
, is converted to the “Date” format. If not, ggplot()
and other functions in R may not be able to correctly recognize and process dates.
- Try to plot the price trend without converting the date variable
You may have seen stock prices being plotted into “candlesticks.” The candlestick chart is way to visualize stock price movements by showing (1) open prices, (2) close prices, (3) highest prices, and (4) lowest prices together in one plot. The candlestick chart offers a neat way to convey rich information in prices.
Let’s use ggplot()
to generate a candlestick chart for the period between 08/09/2025 and 12/09/2025:
<- ggplot(aapl_tbl %>%
candle_sticks filter(date>=as.Date("2025-09-08") & date<=as.Date("September 12, 2025", format = "%B %d, %Y"))
aes(x = date, open = Open, high = High, low = Low, close = Close)) +
, ::geom_candlestick() +
tidyquantlabs(title = "AAPL Candlestick Chart",
y = "Price", x = "Date")
Candlestick chart is widely used to represent stock price information. For each trading day (or week, month, quarter, …), the chart use a thin bar to represent open and closing prices and long sticks to show the highest and the lowest prices of a day.
In addition, it uses colors to distinguish gains and losses. In the chart below, red bars and sticks represent losses, whereas blue bars and sticks represent gains.
Read the candlestick chart of AAPL between 08/09/2025 and 12/09/2025. Given that Apple Inc. announced its new iPhone, Airpods, and Apple Watches on 10/09/2025, how do you think Apple’s investors perceive the new product?
candle_sticks
9.3.3 From Prices to Returns
A stock (or share) represents a small piece of ownership in a company. If you buy one share of Apple, you literally own a tiny slice of Apple. As a shareholder, you benefit when the company does well because the value of your share can go up, and you may also receive dividends, which are cash payments companies sometimes give to their owners.
Stock prices simply how much one share costs in the market. For example, if Apple is trading at $234.07 as of 12/09/2025, that means you would pay $234.07 to own one share or sell a share of Apple at $234.07. Stock prices change constantly throughout the day as investors react to news, company announcements, or broader economic conditions.
When we look at stocks, we usually care less about the exact price levels but more about how that price changes over time. That is where the concept of a return comes in. A stock’s return tells us the percentage gain or loss over a given period. For example, without considering any dividends, if Apple’s share price rises from $234.07 to $250, the return is about 6.8% ((250.00 - 234.07)/234.07). In simple terms, returns show you how much your money has grown (or shrunk) as a percentage, making them directly comparable across investments in different companies and time periods.
Assume there is no dividends, stock splits, and repurchases, we can use the stock prices at two different time points (\(t1\) and \(t2\)) to calculate the returns accrued in the period between the two time points:
\(Return_{t1, t2} = (Price_{t2} - Price_{t1})/Price_{t1}\)
In practice, investors are interested in returns of fixed time periods, such as daily returns, monthly returns, quarterly returns, and annual returns. We use the differences between the closing price at the end of the period and the closing price right before the period to calculate such returns.
Hence, if we would like to calculate daily returns for AAPL on day \(t\), we are going to first calculate the changes in closing prices from day \(t-1\) to \(t\) and then scaled by the closing price of day \(t-1\) (the base period). To do so, we need to use the lag()
operation:
# Calculate returns from prices
<- aapl_tbl %>%
aapl_return arrange(date) %>% # Sort in ascending order to avoid mistakes
transmute(date, Price = Adjusted,
Return = (Adjusted - lag(Adjusted))/lag(Adjusted))
There are different prices in the data obtained through getSymbols()
. Conceptually, we only need Close
to calculate daily returns. However, in reality, prices could change due to reasons outside the stock market. For example, dividend payouts will affect stock prices and the number of shares outstanding, making post-payout prices incomparable to prices prior to payouts. Stock splits also make prices incomparable across different days.
Here is an example:
Suppose Firm A has 10,000 shares outstanding, and each share is priced at $10 on September 1, 2025. On September 2, the firm announced that they are going to split each share into two shares. Stock split does not affect total market cap but affect how many pieces a cake is divided into.
- In this case, the total market cap remains unchanged: \(10 \times 10,000 = 100,000\).
- Number of shares outstanding doubled to 20,000.
- Price for each share becomes $5.
Does it mean that investors of Firm A experience a 50% loss overnight?
As a result, we need to adjust these factors so that prices are comparable over time. In the data, Adjusted
is the prices that have been adjusted so that all prices are comparable over time. When calculate returns, we need to use Adjusted
rather than Close
to ensure that prices are comparable and returns are correctly calculated.
transmute()
transmute()
combines mute()
and select()
. We can both select columns and do operations over columns ithin transmute()
.
9.3.4 Cumulative Returns
As an investor, you may be interested in how much money you are making by buying and holding a stock. We can achieve this goal by either calculating price changes or compounding daily returns. Compounding of returns are more commonly used in portfolio management as it may be difficult to get “prices” for each portfolio. To prepare us to analyze portfolio returns, we are going to get familiar with compounding operations in R:
<- aapl_return %>%
aapl_cumret mutate(Return = if_else(is.na(Return), 0, Return)) %>% # Fill missing as 0
mutate(Cumret = cumprod(1 + coalesce(Return, 0)) -1)
In the code chunk above, we use cumprod(1 + coalesce(Return, 0)) - 1
to calculate cumulative returns. If we express it in mathematically, we are simply calculating:
\(Cumret_t = (1 + Return_1)\times(1 + Return_2) \times ... \times (1 + Return_{t-1}) \times (1 + Return_t) - 1\)
Return
is the stock return for each daycoalesce(Return, 0)
means that we replace missing values inReturn
as 0 (otherwise, everything after a missing value will be missing)- To do compounding, we need to add 1 to returns and then multiply these returns up to a given date
cumprod
is used to calculate cumulative product of all daily returns up to a given date
With Cumret
, we can visualize the ups and downs of our wealth. Suppose we invest $100 into AAPL at the beginning of 2025, the plot below shows how the $100 evolves throughout the year:
# Do some transformation to Cumret so that it reflects investment value
ggplot(aapl_cumret, aes(x = date, y = 100 * (Cumret + 1))) +
geom_line() +
scale_x_date(date_breaks = "1 months", date_labels = "%b-%Y") +
labs(title = "Value of Investment in AAPL",
y = "Value", x = "Date") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
For a single stock, looking at prices directly is actually easier than doing the compounding return above. However, when it comes to an investment portfolio of multiple stocks, our approach using compounding returns becomes much better by unifying all stocks using returns and avoiding dealing with prices of multiple stocks.
9.4 Panel Data
Panel data is a combination of cross-sectional and time-series data. Instead of only looking across firms at one point in time (cross-sectional), or following a single firm across time (time-series), panel data records many firms across many points in time.
Most of time, the data available to us comes as panel data. Panel data has two (orthogonal) dimensions, entity and time. Below is an example of panel data where rows represents each entity and columns represents time points.
<- read_csv('../data/panel_example.csv') panel
Rows: 10 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (6): id, Year1, Year2, Year3, Year4, Year5
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
panel
# A tibble: 10 × 6
id Year1 Year2 Year3 Year4 Year5
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0.520 0.600 0.584 0.562 0.554
2 2 0.215 0.252 0.0458 0.626 0.396
3 3 0.0319 0.880 0.363 0.274 0.707
4 4 0.817 0.550 0.852 0.869 0.594
5 5 0.743 0.110 0.895 0.0314 0.205
6 6 0.439 0.554 0.749 0.581 0.0274
7 7 0.748 0.902 0.501 0.612 0.640
8 8 0.372 0.886 0.942 0.819 0.245
9 9 0.482 0.120 0.560 0.492 0.267
10 10 0.542 0.807 0.781 0.296 0.720
In practice, such data is often stored into the “long form” rather than the “short form”:
# Make the data as long table
<- panel %>%
panel_long pivot_longer(
cols = starts_with("Year"), # columns to reshape
names_to = "Year", # new column for year labels
values_to = "Value" # new column for values
)
panel_long
# A tibble: 50 × 3
id Year Value
<dbl> <chr> <dbl>
1 1 Year1 0.520
2 1 Year2 0.600
3 1 Year3 0.584
4 1 Year4 0.562
5 1 Year5 0.554
6 2 Year1 0.215
7 2 Year2 0.252
8 2 Year3 0.0458
9 2 Year4 0.626
10 2 Year5 0.396
# ℹ 40 more rows
The strength of panel data is that it lets us study both differences across firms and changes within firms over time in the same framework. This opens the door to much more powerful analysis. For example, we could ask: “Do firms with higher profitability earn higher future returns?” By using panel data, we can track the same firm’s financial ratios and returns over time while also comparing across many firms at once. This dual perspective helps us avoid misleading results that might come from only looking at a single snapshot or only tracking one company.
Panel data is especially important because it allows us compare one observation with other observations at the same time point or with itself in the past. Such features of panel data help us draw more reliable conclusions and inferences from data.
Panel data allows us to use fixed effects to isolate the effect of individual entities or time. Most of us may have used the notion of fixed effects consciously or unconsciously. For example, when we compare entities with other comparable entities at the same time point, we are actually controlling for time fixed effects. When we compare someone with its past, we are accounting for individual entity fixed effects.
Fixed effects refers to the average effect either at the entity level or at the time level. Understanding the concept of fixed effects may help avoid fallacys in drawing conclusions.
Time fixed effects
- Adjust for events that affect all entities
- Example: during the 2008 financial crisis or the COVID-19 pandemic, stock prices fell for almost every firm
- We should not attribute the decline of stock prices to firm-specific reasons if this is a market-wide phenomenon
- We can figure which firms experience declines of relatively smaller magnitude than others or compare a firm with the average of all entities at the same time point
Firm Fixed Effects
- Account for entity-level time-invariant effects
- Certain industries are innately more profitable than others (e.g., high tech vs. cafe)
- If we observe that AAPL makes net profit margin of 25.26%, while a local cafe makes 10%, it does not necessarily mean that the local cafe is doing poorly.
- We need to compare “Apple” to “Apple”!
9.4.1 Construct an Investment Portfolio
Do not put all your eggs in one basket!
In practice, quantitative hedge funds and traders rarely put all their money into just one stock. Instead, they spread their investments across many different stocks, creating what we call a portfolio. By looking at these portfolios, we can see two kinds of variation: cross-sectional variation, which tells us how different firms perform on the same day, and time-series variation, which tracks how a single firm’s performance changes from day to day. The job of a quant trader is to analyze both types of variation and find patterns that can guide trading decisions. In this lecture, we are going to step into their shoes and explore real data to see how this works.
We have obtained a list of stocks that ranks top 10 in terms of market cap in our earlier exercises. Now, we will use getSymbols()
to retrieve their stock prices and combine all these data as a panel. Note that you are not required (but highly recommended) to fully understand the block of code below.
# Get the List of Tickers
<- top5$tic
ticker_list
# Get prices for all stocks on the list
getSymbols(ticker_list, from = "2025-01-01", auto.assign = TRUE)
[1] "MSFT" "AAPL" "AMZN" "NVDA" "META"
# Grab all output objects into a list
<- mget(ticker_list, inherits = TRUE)
prices_list
# Convert a list of xts objects into a tibble
<- imap_dfr(
all_stocks
prices_list,~ {
<- .x
xt <- .y
tic <- as_tibble(data.frame(date = index(xt), coredata(xt)))
df # Strip any "AAPL." / "MSFT." etc. prefix from all price columns
names(df) <- sub("^.*\\.", "", names(df))
%>%
df mutate(ticker = tic) %>%
select(date, ticker, Open, High, Low, Close, Volume, Adjusted)
} )
9.4.2 Visualize Returns of Individual Stocks
Now, we are going to visualize the ups and downs of the five stocks in our portfolio since 01/01/2025. As you can see from the data, the price levels of stocks vary significantly across different firms. In order to make them comparable, a better option is to calculate cumulative returns and make a plot with five stocks sharing the same starting point. To do this task, we can replicate what we have done for AAPL to other stocks:
For each stock (i.e., group_by(ticker)
), calculate daily returns and then compound returns up to each date:
# Calculate cumulative returns
<- all_stocks %>%
all_stocks group_by(ticker) %>%
arrange(date, .by_group = TRUE) %>%
mutate(
Return = Adjusted / lag(Adjusted) - 1,
Cumret = cumprod(1 + coalesce(Return, 0)) - 1
%>%
) ungroup()
Similar to what we have done for AAPL, assume that we invest 100 dollars into each stock, plot the value of investment for each stock since 01/01/2025:
ggplot(all_stocks, aes(x = date, y = 100 * (Cumret + 1), color = ticker)) +
geom_line(size = 0.75) +
labs(
title = "Value of Investment since 2025-01-01",
x = "Date",
y = "Value of Investment",
color = "Ticker"
+
) scale_x_date(date_breaks = "1 months", date_labels = "%b-%Y") +
theme_minimal(base_size = 12)
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
9.4.3 Calculate Equal-Weighted Portfolio Returns
Quant trading always invests in multiple stocks and construct a portfolio to manage all investment targets. When constructing portfolios, a fund can invest in constituent stocks equally. For example, if you have five stocks in your portfolio, you put 20% of your wealth in each of them. Such portfolios are called as equal-weighted portfolios.
In some cases, we do not want to evenly split our money among serveral stocks. We may want to invest a big more on certain stocks based on our assessment or statistical analyses. One common practice of quant investment is to construct value-weighted portfolios based on market capitalization. Also, you may place larger weights on Nvidia (NVDA) and Microsoft (MSFT) because you expect that the positive (or even overly positive) sentiments for AI may push stock prices of these stocks even higher.
Here, we can use dplyr
to easily summarize portfolio_returns
at the portfolio level and calculate returns for such equal-weighted portfolio:
# Equal weighted portfolio
<- all_stocks %>%
pf_ew group_by(date) %>%
summarise(
ew_ret = mean(Return, na.rm = TRUE) # equal weight = simple average
%>%
) arrange(date) %>%
mutate(
ew_cumret = cumprod(1 + coalesce(ew_ret, 0)) - 1
)
# Plot returns of equal-weighted portfolio over time
ggplot(pf_ew, aes(x = date, y = ew_cumret)) +
geom_line(color = "darkblue", size = 1) +
labs(
title = "Equal-Weighted Portfolio Cumulative Return",
x = "Date", y = "Cumulative Return"
+
) scale_x_date(date_breaks = "1 months", date_labels = "%b-%Y") +
theme_minimal(base_size = 12) +
scale_y_continuous(labels = scales::percent)
10 min
Make use of the variables in sp500_xs
to construct value-weighted portfolio:
Assign weights using market capitalization as of 2024-12-31. Note that you do not want to use information or data that is not available when you construct any porfolio (i.e., no foresight).
Aggregate the five stocks into a value-weighted portfolio and visualize the return of the portfolio.
Solution
<- all_stocks %>%
all_stocks_vw left_join(top5 %>% select(ticker = tic, mktcap),
by = 'ticker')
<- all_stocks_vw %>%
pf_vw group_by(date) %>%
summarise(
vw_ret = weighted.mean(Return, mktcap, na.rm = TRUE),
.groups = "drop"
%>%
) arrange(date) %>%
mutate(
vw_cumret = cumprod(1 + coalesce(vw_ret, 0)) - 1
)
# Plot returns of equal-weighted portfolio over time
ggplot() +
geom_line(data = pf_ew, aes(x = date, y = ew_cumret, color = "Equal-Weighted"), linewidth = 0.75) +
geom_line(data = pf_vw, aes(x = date, y = vw_cumret, color = "Value-Weighted"), linewidth = 0.75) +
labs(
title = "Portfolio Cumulative Return",
x = "Date", y = "Cumulative Return (%)",
colar = "Portfolio Types"
+
) theme_minimal(base_size = 12) +
scale_y_continuous(labels = scales::percent)
9.5 Conclusion
In this lecture, you learned how data structure shapes the questions you can ask and the tools you should use:
Cross-sectional data compares entities at one time (e.g., S&P 500 firms on 31/12/2024).
Time-series data tracks one entity over time (e.g., AAPL daily prices in 2025).
Panel data combines both, letting you study differences across firms and changes within firms over time.
Key takeaways
Always identify your unit of observation (entity, time) and understand the variation of your data along different dimensions before proceeding with data analyses.
Data wrangling skills for dates and times
Simple quant trading workflows (more in the workshop)