In Irrational Exhuberance, Robert Shiller compiled over 100 years of data to demonstrate how market dynamics create cycles in which the stock market prices become disconnected with valuations. One criticism1 of his work is that prices are set by the market, not by valuations, and because of this no one can say when the market is “overvalued” or “undervalued.” I intend to use the dataset to explore the link between value and price.
The data is already in tidy format, but still requires some wrangling prior to exploration.
## 'data.frame': 1769 obs. of 11 variables:
## $ Date : num 1871 1871 1871 1871 1871 ...
## $ Price : num 4.44 4.5 4.61 4.74 4.86 4.82 4.73 4.79 4.84 4.59 ...
## $ Dividend : num 0.26 0.26 0.26 0.26 0.26 0.26 0.26 0.26 0.26 0.26 ...
## $ Earnings : num 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 ...
## $ CPI : num 12.5 12.8 13 12.6 12.3 ...
## $ DateFraction: num 1871 1871 1871 1871 1871 ...
## $ GS10 : num 5.32 5.32 5.33 5.33 5.33 5.34 5.34 5.34 5.35 5.35 ...
## $ RealPrice : num 89 87.6 88.4 94.3 99 ...
## $ RealDividend: num 5.21 5.06 4.99 5.17 5.3 5.38 5.38 5.46 5.34 5.25 ...
## $ RealEarnings: num 8.02 7.78 7.67 7.96 8.15 8.27 8.27 8.41 8.21 8.08 ...
## $ CAPE : num NA NA NA NA NA NA NA NA NA NA ...
The str
command shows that there are two date fields.
Both use a YYYY.MM format, with the first numbering the months 01 through 12 and the second representing months as a fraction of the year. I reviewed the background information2 on the dataset and the original excel file. The most logical explanation I could find was that the second format was created for use as the X-axis for the charts in the excel file. The first format was likely an output of another program or data source and is not a format interpretable by excel.
I converted the values into an R friendly Date format and dropped the DateFraction column.
The dataset contains the 10-Year Treasury Constant Maturity Rate labeled as GS10. The rates are in percentage form, most likely to make them easier to plot alongside the other features in the original excel file. I converted them to decimal form to prepare for analysis.
As an additional preparation step I created several new factors from the data:
Inflation was captured out of curiosity of its affect on future returns. Momentum was measured because studies have shown that price advances and declines can persist.3 I am interested to see if it will make a good complement to value in predicting returns.
I created a pct_change
function to extract the above features from the price and CPI columns. I also created a function to shift the results of the pct_change
for building the future returns values.
Furthermore, I used categorical variations of the inflation and momentum to assist in exploring positive and negative conditions of each.
## 'data.frame': 1769 obs. of 19 variables:
## $ Date : Date, format: "1871-01-01" "1871-02-01" ...
## $ Price : num 4.44 4.5 4.61 4.74 4.86 4.82 4.73 4.79 4.84 4.59 ...
## $ Dividend : num 0.26 0.26 0.26 0.26 0.26 0.26 0.26 0.26 0.26 0.26 ...
## $ Earnings : num 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 ...
## $ CPI : num 12.5 12.8 13 12.6 12.3 ...
## $ GS10 : num 0.0532 0.0532 0.0533 0.0533 0.0533 0.0534 0.0534 0.0534 0.0535 0.0535 ...
## $ RealPrice : num 89 87.6 88.4 94.3 99 ...
## $ RealDividend : num 5.21 5.06 4.99 5.17 5.3 5.38 5.38 5.46 5.34 5.25 ...
## $ RealEarnings : num 8.02 7.78 7.67 7.96 8.15 8.27 8.27 8.41 8.21 8.08 ...
## $ CAPE : num NA NA NA NA NA NA NA NA NA NA ...
## $ Momentum : num NA NA NA NA NA NA NA NA NA NA ...
## $ CPIChange : num NA NA NA NA NA NA NA NA NA NA ...
## $ Growth : Factor w/ 2 levels "Expansion","Recession": NA NA NA NA NA NA NA NA NA NA ...
## $ Inflation : Factor w/ 2 levels "Deflation","Inflation": NA NA NA NA NA NA NA NA NA NA ...
## $ Fwd1yrReturns : num 0.0946 0.0844 0.0933 0.0928 0.0658 ...
## $ Fwd3yrReturns : num 0.0495 0.0667 0.026 -0.0295 -0.0782 ...
## $ Fwd5yrReturns : num 0.0045 0.00444 -0.02169 -0.08439 -0.13992 ...
## $ Fwd10yrReturns: num 0.394 0.371 0.354 0.312 0.337 ...
## $ Outlook : Factor w/ 2 levels "Bearish","Bullish": 2 2 2 2 2 2 2 2 2 2 ...
## Date Price Dividend Earnings
## Min. :1871-01-01 Min. : 2.73 Min. : 0.180 Min. : 0.160
## 1st Qu.:1907-11-01 1st Qu.: 7.74 1st Qu.: 0.410 1st Qu.: 0.540
## Median :1944-09-01 Median : 16.42 Median : 0.830 Median : 1.325
## Mean :1944-08-31 Mean : 259.74 Mean : 5.637 Mean : 12.810
## 3rd Qu.:1981-07-01 3rd Qu.: 122.90 3rd Qu.: 6.370 3rd Qu.: 13.607
## Max. :2018-05-01 Max. :2789.80 Max. :50.000 Max. :109.880
## NA's :2 NA's :5
## CPI GS10 RealPrice RealDividend
## Min. : 6.28 Min. :0.01500 Min. : 67.67 Min. : 4.99
## 1st Qu.: 10.10 1st Qu.:0.03290 1st Qu.: 170.13 1st Qu.: 8.53
## Median : 18.20 Median :0.03860 Median : 253.23 Median :12.73
## Mean : 57.84 Mean :0.04569 Mean : 509.86 Mean :15.17
## 3rd Qu.: 91.60 3rd Qu.:0.05220 3rd Qu.: 613.87 3rd Qu.:19.27
## Max. :249.98 Max. :0.15320 Max. :2813.54 Max. :50.08
## NA's :2
## RealEarnings CAPE Momentum CPIChange
## Min. : 4.19 Min. : 4.78 Min. :-0.65609 Min. :-0.19700
## 1st Qu.: 12.66 1st Qu.:11.79 1st Qu.:-0.06067 1st Qu.: 0.00000
## Median : 20.45 Median :16.17 Median : 0.06691 Median : 0.02264
## Mean : 29.87 Mean :16.86 Mean : 0.06119 Mean : 0.02236
## 3rd Qu.: 39.34 3rd Qu.:20.47 3rd Qu.: 0.18493 3rd Qu.: 0.04615
## Max. :111.42 Max. :44.20 Max. : 1.24152 Max. : 0.23669
## NA's :5 NA's :120 NA's :12 NA's :12
## Growth Inflation Fwd1yrReturns Fwd3yrReturns
## Expansion:1113 Deflation: 474 Min. :-0.65609 Min. :-0.82409
## Recession: 644 Inflation:1283 1st Qu.:-0.06067 1st Qu.:-0.05511
## NA's : 12 NA's : 12 Median : 0.06691 Median : 0.16223
## Mean : 0.06119 Mean : 0.18797
## 3rd Qu.: 0.18493 3rd Qu.: 0.40225
## Max. : 1.24152 Max. : 1.38523
## NA's :12 NA's :36
## Fwd5yrReturns Fwd10yrReturns Outlook
## Min. :-0.71629 Min. :-0.61661 Bearish: 644
## 1st Qu.:-0.05574 1st Qu.: 0.07159 Bullish:1113
## Median : 0.24275 Median : 0.49322 NA's : 12
## Mean : 0.33082 Mean : 0.72440
## 3rd Qu.: 0.64044 3rd Qu.: 1.29522
## Max. : 2.38378 Max. : 3.65442
## NA's :60 NA's :120
The final dataset contains 1769 observations of 22 variables.
Because the value of price is measured over time it is best plotted as a time series. The scale of the first plot makes it difficult to compare the data starting in the late 1800s to more recent years. The dataset already contains a transformed version in the form of RealPrice, which has been adjusted for inflation. Taking the log10 of the prices makes them even more readable, which is a natural transformation because price growth is geometric. The final inflation adjusted and log transformed graph makes it easy to compare the growth and recessions over the entire history of the dataset.
Logically, the dividends and earnings should benefit from the same transformations.
Dividends show the same steady growth over time as price. But the second half of the data appears to have much lower variance than the first half. I wonder if this is due to the decision to start taxing dividends in 1954.4
An interesting feature of the earnings data are the two significant dips that occurring during the Great Depression and Great Recession of the 1930’s and 2008. Earnings appear to have been impacted even harder than stock prices.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01500 0.03290 0.03860 0.04569 0.05220 0.15320
The 10-Year Treasury rates (GS10) show a huge spike in the 1970s as the Fed and policy makers sought to curb runaway inflation. This creates a long right tail in the distribution of rates. Transforming the scale to shows multiple peaks in the distribution. The GS10 rate is heavily influenced by the Fed Funds rate. Could these peaks be the default speeds of the Fed for boosting and taming the economy?
CAPE, the Cyclically Adjusted Price to Earnings Ratio, is the heart of the dataset. Robert Shiller created the ratio to smooth for inflation and business cycle affects by dividing the inflation adjusted price by 10 year average of the earnings.
Transforming to log scale makes it easier to see an important characteristic. The tails of the distribution represent the cheapest and most expensive readings for the market. However, the normal scale of the x axis diminishes the fact that changes in the lower end represent proportionately larger percentage differences in price per earnings. For example, the difference between a value of 4 and 5 means paying 25% more for earnings, whereas 40 to 41 is only and increase of 2.5%. The log scale balances the emphasis placed on the rare occurrence of valuations at both ends of the spectrum.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -0.65609 -0.06067 0.06691 0.06119 0.18493 1.24152 12
The Momentum distribution has long tails, with outlier years returning over 100% and losing more than 50%. The boxplot makes these extreme occurrences easy to see. The mean annual return was 6.12%.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -0.19700 0.00000 0.02264 0.02236 0.04615 0.23669 12
The distribution of CPIChange shows very high kurtosis, with a mean of 2.24%. This aligns with the Fed’s target, while the tails show just how far inflation and deflation can go when they get out of hand.
The bar charts show that the US has spent most of period in a state of rising inflation and growth.
## Fwd1yrReturns Fwd3yrReturns Fwd5yrReturns
## Min. :-0.65609 Min. :-0.82409 Min. :-0.71629
## 1st Qu.:-0.06067 1st Qu.:-0.05511 1st Qu.:-0.05574
## Median : 0.06691 Median : 0.16223 Median : 0.24275
## Mean : 0.06119 Mean : 0.18797 Mean : 0.33082
## 3rd Qu.: 0.18493 3rd Qu.: 0.40225 3rd Qu.: 0.64044
## Max. : 1.24152 Max. : 1.38523 Max. : 2.38378
## NA's :12 NA's :36 NA's :60
## Fwd10yrReturns
## Min. :-0.61661
## 1st Qu.: 0.07159
## Median : 0.49322
## Mean : 0.72440
## 3rd Qu.: 1.29522
## Max. : 3.65442
## NA's :120
Note that the Fwd1yReturns are the same values as momentum, but shifted 1 year.
An interesting feature of the forward returns is that they become more positively skewed the longer the duration. The skew is the affect of compounding, but there is a lower limit due to prices having never been below 0.
A logical transformation is to convert all of the returns to their compound annual growth rate (CAGR). This will also make it easier to compare them to each other.
## [1] "1 year forward returns"
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -0.65609 -0.06067 0.06691 0.06119 0.18493 1.24152 12
## [1] "3 year forward returns"
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -0.43968 -0.01872 0.05139 0.04896 0.11929 0.33611 36
## [1] "5 year forward returns"
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -0.22273 -0.01141 0.04442 0.04690 0.10406 0.27609 60
## [1] "10 year forward returns"
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -0.09142 0.00694 0.04091 0.04517 0.08663 0.16624 120
The oft referred to fat tails of the returns are still present and a slight negative skew is visible. This is characteristic of the risks investors face; extreme returns occur more often than they would with a normal distribution, with greater than normal occurrence negative returns.
The mean annual returns decrease as the forward looking periods decrease. This is a result of the negative skew and its affect on compounding. Large losses require even larger gains for recovery, causing realized CAGRs to be lower than the mean Fwd1yrReturns.
After seeing the transformed values I decided to store them for easier access during rest of the analysis.
The dataset contains 1769 months of data covering 22 features, one of which being the date. The data covers the price, dividends and earnings for the US stock market since 1871. Also included are the Consumer Price Index (CPI) and the 10-Year Treasury rates. The first allows conversion of values from nominal to inflation adjusted levels. The Treasury rates are commonly used as a risk free benchmark.
There are also many NA values in the data because of the numerous values created from running calculations. CAPE, for example, requires 10 years of earnings values to smooth before it begins showing values.
Shiller’s CAPE ratio is the main feature of the dataset, allowing the analysis of market valuations over the extended time period. I also introduced momentum measures so that I can attempt to build a model for forecasting future returns.
CPI is crucial in transforming the data over time to account for inflation. In addition, I expect to see that inflation itself will have affects on stock returns. I believe this will also be the case for interest rates, which are often used by governments to boost economic growth.
A measure of momentum and inflation were created to show their running 12 month rates of change. Forward returns were also calculated and then transformed into compound annual growth rates (CAGRs) during the analysis.
Categorical values were created to allow partitioning the data into periods of Rising and Falling states of Growth and Inflation. An Outlook categorical variable was also created to test if predicting bullish and bearish periods was more feasible than predicting the specific level of return.
Skewed distributions are the norm in financial data, starting with the long tails in returns and carried through to prices with the effect of compounding over time. Interest rates and inflation also showed long tails.
Log transformations were applied to prices, dividends and earnings to reduce positive skew. They are also a natural transformation to apply because they convert the exponential growth of continuous compounding into an additive one. The CAPE ratio is generated from those variables and benefited from the same transformation.
I also took advantage of the inflation adjustments that were already in the data set to help visualize changes in the variables over time.
Last, I transformed the different forward return measures to annual measures to reduce skew and so that they could be directly compared.
No surprises here, all of the stock features are highly correlated. Applying the log transform and transparency to reduce over plotting shows that they have strong linear relationships and have varied together over time.
There appears to be a negative correlation to forward returns, but the effect is less prominent in the Fwd1yrReturns. It looks as though the CAPE might be less valuable in near term forecasting than for making long term predictions.
The plots also look like they could benefit from using the Log10 transformed version of CAPE.
The cor.test
function showed that he Pearson R correlations were -0.19, -0.30, -0.37 and -0.40, in order of increasing forward looking periods.
The following plots break CAPE into quintiles and show the mean returns for each quintile.
Average returns are higher starting from points when the market was selling at a discount. 1 yr returns show the highest average for the lowest CAPE reading. But the just looking at the mean does not say anything about the distribution of those returns.
The boxplots show that the means are affected by skewed distributions. The median returns are more of an indication of what investors would earn during a typical year.
There doesn’t appear to be a strong correlation between Momentum and forward returns. It looks like there are outliers. It might help to zoom in.
There still doesn’t appear to be a correlation, which surprises me. Momentum is the tool of choice of trend followers. Maybe the categorical Growth variable will shed some light.
Separating the data into Rising and Falling Growth periods does not appear to provide any benefit at all in predicting higher returns. However, it does tighten the distribution and reduce the tails of the forward 1 year returns, a particular benefit in reducing risk. This is confirmation of something I have learned from studying trend following in the past: its main benefit is not in finding higher returns, but in avoiding the worst periods. However, I was expecting at least some correlation to higher future returns. If momentum were the focus of this study I would consider looking at shorter forward looking periods.
One thing to note is that this connection reverses after the first year, with the 3 and 5 year plots showing the opposite connection and lower average returns after Rising Growth readings.
It looks like there could be correlation over longer time periods. For 1 year the Pearson R is only .05, but it increases to .14, .21 and .39 over the respective longer periods.
##
## Pearson's product-moment correlation
##
## data: cape$GS10 and cape$CAPE
## t = -6.4187, df = 1647, p-value = 1.793e-10
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2029604 -0.1087671
## sample estimates:
## cor
## -0.1562189
##
## Pearson's product-moment correlation
##
## data: cape$GS10 and log10(cape$CAPE)
## t = -9.0448, df = 1647, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2630431 -0.1710574
## sample estimates:
## cor
## -0.2175332
It does look like there could be some negative correlation between GS10 rates and CAPE. But the cor.test
function returned -.16 indicating that it is very weak. The correlation is slightly stronger at -.22 when using the log10 transformed values of CAPE.
Although it is not readily apparent in the plot, the cor.test
function found a weak correlation of .36 between CPIChange and Fwd10yrCAGR. It looks like inflation only correlates with returns over long periods.
Inflation does not show correlation to Momentum either.
No correlation was visible between Inflation( as CPIChange) and the GS10 rates. I was expecting to find one because of the Federal Reserve’s use of rate policy to manage the inflation rate. It’s possible that there is a delay between market changes and policy reactions, and that lagging one against the other would reveal correlation.
Forward returns were negatively correlated with CAPE. They did not show correlation to Momentum, Inflation or the GS10 rates. However, their distribution did appear tighter in the 1 year forward returns following rising momentum.
Price, Dividend and Earnings were all highly correlated. This was expected as it was visible in the time series plots that they have varied together over the duration of the dataset.
CAPE and GS10 had a weak negative correlation. This could be the affect of investors being willing to pay a higher price for returns during low interest rate periods, and vice versa.
The strongest relationship was between CAPE and forward returns, with the correlation increasing over the longer periods tested.
It’s difficult to see in this plot, but it does look like there is a stronger trend between CAPE and forward 1 year returns during periods of recession.
Using facet_wrap
to separate the conditions makes the difference easier to see. Periods of recession in the stock market exhibit a stronger negative correlation between CAPE and price than during expansion.
The feature combination doesn’t appear to have the same benefit over longer time periods.
Inflation and Deflation also appear to complement CAPE’s correlation to returns.
The numeric measures of Inflation and Growth did not show correlation during bivariate analysis, suggesting that they should work well together to enhance the power of CAPE.
CAPE looks strongest in predicting forward 1 year returns during deflationary recession years.
The inverse of CAPE can be interpreted as a yield similar to interest rates. Are investors willing to pay more for earnings when interest rates are low?
There does appear to be a positive correlation between the excess yield of stocks and forward returns.
Using the log10 inverse of CAPE makes the relationship look even stronger. The Pearson R correlations between the new value and forward returns are .18, .28, .34 and .35, indicating that this combination does not provide any benefit. These values are slightly lower than using CAPE alone.
I will use lm
to fit a linear model to predict future returns.
##
## Calls:
## Model 1: lm(formula = Fwd1yrReturns ~ log10(CAPE) + Growth + Inflation +
## GS10, data = cape)
## Model 2: lm(formula = Fwd3yrCAGR ~ log10(CAPE) + Growth + Inflation +
## GS10, data = cape)
## Model 3: lm(formula = Fwd5yrCAGR ~ log10(CAPE) + Growth + Inflation +
## GS10, data = cape)
## Model 4: lm(formula = Fwd10yrCAGR ~ log10(CAPE) + Growth + Inflation +
## GS10, data = cape)
##
## ===========================================================================================
## Model 1 Model 2 Model 3 Model 4
## -------------- ------------- ------------- -------------
## Fwd1yrReturns Fwd3yrCAGR Fwd5yrCAGR Fwd10yrCAGR
## -------------------------------------------------------------------------------------------
## (Intercept) 0.290*** 0.233*** 0.198*** 0.138***
## (0.038) (0.020) (0.015) (0.009)
## log10(CAPE) -0.193*** -0.183*** -0.161*** -0.118***
## (0.028) (0.015) (0.011) (0.006)
## Growth: Recession/Expansion 0.001 -0.008 0.007 -0.006*
## (0.010) (0.005) (0.004) (0.002)
## Inflation: Inflation/Deflation -0.012 0.034*** 0.024*** 0.034***
## (0.011) (0.006) (0.004) (0.003)
## GS10 0.238 0.251* 0.465*** 0.522***
## (0.206) (0.110) (0.082) (0.048)
## -------------------------------------------------------------------------------------------
## R-squared 0.036 0.117 0.182 0.352
## adj. R-squared 0.034 0.115 0.180 0.351
## sigma 0.184 0.097 0.072 0.041
## F 15.355 53.491 88.231 207.330
## p 0.000 0.000 0.000 0.000
## Log-likelihood 452.680 1474.437 1924.220 2699.048
## Deviance 55.129 15.177 8.257 2.622
## AIC -893.360 -2936.874 -3836.440 -5386.096
## BIC -860.957 -2904.558 -3804.215 -5354.101
## N 1637 1613 1589 1529
## ===========================================================================================
The strongest predictions were for 10 year forward returns, but with only 35% of the variance in returns being explained by the model.
A classification model can also be used to predict the Outlook variable. I used rpart
to create a tree-based model to predict if the next year would be bullish or bearish.
## pred
## Bearish Bullish
## Bearish 372 272
## Bullish 114 999
The model was 78% accurate in labeling the dataset. One way of measuring the power of a classification model is to compare it to a majority classifier, which would have labelled all of the data as bullish with 63% accuracy.
The Growth and Inflation features both strengthened the correlation between CAPE and 1 year forward returns. CAPE appears to have the strongest correlation to forward returns after periods of recession and deflation. One possible explanation is that periods of expanding growth and inflation push all prices up, without discrimination to the fundamental value of the market. In the reverse situation, a benefit of this relationship is that CAPE can be used to help identify good opportunities after stock and consumer prices have been falling.
Linear models were used to predict future returns. The models showed that returns are easier to predict over longer time horizons, but even the 10 year model was only able to predict 35% of the variation in returns.
A classification model was able to predict the 1 year outlook with 78% accuracy. This showed some strength over the majority classier.
To settle curiosity, the models are 71% confident that the next year will be bullish, but are only expecting a 1% return over the next 10 years (as of May 2018).
One caveat to keep in mind is that I did not split the data into training and test sets to control for over-fitting. These models were purely for exploring relationships in the data.
This plot shows the log adjusted change in stock prices since 1871. Stocks have been a good investment, but the ride has been anything but smooth.
Market valuations have power in predicting future returns. The average 10 year returns for the stock market are higher and with less variance starting from low CAPE readings. Coincidentally, the CAPE ratio is currently at 31 as of 2018-05-01. Current valuations suggest that the 10 year outlook for stocks is grim, which runs contrary to public opinion.
Periods of expansion have less variance in returns than those of recession. This is a subtle, but important effect. Most investors have a hard time dealing with volatility. They can benefit from an indicator letting them know when to step aside or increase diversification (i.e. tilt portfolio allocations to bonds).
This project gave me the opportunity to explore an interesting dataset covering over 100 years of stock market data. Some wrangling was required to prepare the data and create additional features of interest. I began by exploring all of the variables, creating a few interesting ones along the say. Then I analyzed the interactions between the features and finally created a few models for predicting stock market action.
Several variables were found to be correlated to future returns. These included the CAPE valuation ratio, interests rates and inflation. However, none of the correlations were very strong. This made building an accurate model for predicting the future returns difficult, with the strongest model only predicting 35% of the variance of returns. On the other hand, predicting whether the next year was bullish or bearish proved easier. A classification tree model was able to predict this feature with 78% accuracy.
This analysis could be enhanced by introducing additional asset classes to the dataset and building a model that created actionable predictions. For example, a model could be created to predict which asset would have the best performance and used to guide portfolio allocation.