NNT emphasizes the slow speed at which the LLN and CLT work under thick tails. Mean convergence is faster than variance conversion, and so on for higher moments. Finite higher moments speeds up lower moment convergence. If higher moments don’t exist, you never converge to lower moments in finite samples.

The LLN doesn’t require finite variance if the rv is i.i.d. However, if the rv is not identically distributed, variance must be finite for LLN to work. It also works with weak dependence (covariance approaches 0 as time between observations increases), but again variance must be finite. The LLN may be visualized as a tightening around the mean, eventually leading to degeneracy (i.e. a Dirac stick at the exact mean).

The standard CLT requires the rv to be i.i.d. and have finite variance, and only applies to the rv under summation. One can weaken the identically distributed requirement, but this requires the rv to have certain moments that grow at a rate limited by the Lyapunov condition, or at least the Lindbergh condition.

To be insurable, events must be non-subexponential i.e. the probability of exceeding some threshold must be due to a series of small events rather than a single large event. The Cramer condition must also be met (exponential moments of the rv must exist). Normally distributed events meet these conditions, but not thick tailed events. In the former case, exceeding some threshold is more likely to come from a series of events (increasingly so as you move into the tails due to exponential decay of tail probabilities)…hence focus on reducing frequency of events. In the latter case, exceeding some threshold is more likely to come from a single event, so focus must be on reducing impact.

The Lucretius fallacy is when one assumes the worst event experienced in the past is the worst event that can happen in the future. Because an empirical distribution is necessarily censured by x_min and x_max, the empirical distribution is not empirical. Beyond the observed max, there is a hidden portion of the distribution not shown in past samples whose moments are unknown (and do not converge via the Glivenko-Cantelli theorem). This is a problem for Kolmogorov-Smirnoff tests. It is better to use MLE to get the ‘excess’ or ‘shadow’ mean (the mean beyond the sample max). Assuming you can estimate the tail exponent, this approach works better for out-of-sample inference than use of the sample mean (biased under thick tails). The lower the tail exponent and smaller the sample, the more the tail is hidden.

For distributions with compact support (bounded domain that includes max and min), all moments will be finite, even if upper bound is super high. In such a case, the distribution would behave very similar to a distribution with infinite upper bound until you approach the sample max. Since the upper bound ‘H’ is so high, samples tend to only produce observations below some threshold ‘M’, M<H<inf. To understand shadow moments (the moments for the portion of the population distribution above M but below H), one can transform the rv with compact support into a rv with an infinite upper bound (such that the infinite max of the transformed variable maps to the finite max of the original variable), use EVT (generalized Pareto) to study the tail behavior above the threshold, and then transform back into a rv with finite upper bound to estimate the shadow moments. If the transformation used is one-to-one, the parameters of the PDF before and after transformation obtained via MLE will be the same.

Instability of variances (e.g. finite sample and infinite fourth moment) leads to unstable covariances (meaning individual distributions appear heteroskedastic but the scales don’t change in tandem). If a covariance matrix is stochastic or regime switching (and the correlations are relatively high, not just jumping around 0), then the joint distribution will not be elliptical. In this case, a linear combination of thin tailed rv’s can become thick tailed. Elliptical distributions have location parameter ‘u’, covariance matrix ‘E’, scalar function ‘Y’ (which is a function of a unique E) and characteristic function of the form f(x)=exp(ix’u)Y(xEx’). Elliptical distributions are closed under linear transformation, so tail events are most likely to come from a series of events rather than one if the component rv’s are thin tailed with finite variance (making room for MPT, since under ellipticality all portfolios are characterized by their location and scale parameters).

Independence is defined as f(x,y)=f(x)f(y). In the class of elliptical distributions, the multivariate Gaussian with correlation coefficient of 0 is both uncorrelated and independent. Since non-ellipticality implies a linear combo of rv’s are in a different distributional class than each individual rv, f(x,y) can’t be the same distributional class as f(x)f(y). So mutual information is not 0 just because correlation is 0. Mutual information for Gaussian is –(1/2)log(1-p^2) where ‘p’ is the correlation coefficient. This will pick up nonlinearities. Note that while covariance can be infinite, correlation is always finite; even though it may have huge sampling error and slow convergence under fat tails.

If data has no structure, all principal components should explain equal amounts of variance (asymptotically). With small samples, pc’s will spuriously decline in importance. This becomes much worse with thick tailed data i.e. the first few pc’s will appear way more important than they really are (meaning dimension reduction doesn’t work with thick tails and finite samples).

For the normal distribution, tail probabilities are convex to the scale of the distribution. This means you can fatten the tail by stochasticizing the conditional variance (i.e. making the process heteroskedastic) while preserving the mean and unconditional variance. As a result, any heavy tailed process can be described in-sample by a Gaussian distribution with stochastic or regime-switching variance, or by adding non-uniform jumps (Poisson jumps can be considered a mix of Gaussians since the jump can be modeled as a Gaussian regime with low variance, low probability, and high mean). Think of conditional heteroscedasticity models as just a mix of Gaussians with different variances, which creates thicker tails. But the fourth moment expresses the stability of the second moment. If the fourth moment is unstable or infinite, then this is the same as saying the variance of the variance is unstable. Since lower moments fail to converge in finite samples when higher moments are infinite, power law processes can disguise themselves as Gaussian processes with time-varying variance. GARCH gives no structure to the variance of the variance, so it may appear to work in-sample but will fail out-of-sample if process is actually power law.

The difference between MAD and STD grows as tails get fatter, with STD growing faster since it gives more weight to larger observations. A Gaussian variable with a MAD of 1% has a STD of about 1.25% (i.e. MAD=sqrt(pi/2)*STD). By Jenson’s Inequality, MAD<=STD. STD is used more in the literature because it’s asymptotically 12.5% more efficient than MAD for a Gaussian variable. With even small jumps, however, MAD becomes more efficient. Some processes have infinite variance but finite MAD, though the reverse is never true. When mean exists, MAD exists. The BSM option pricing model uses variance, but the price maps directly to MAD (an at-the-money straddle reflects conditional mean deviation). So BSM translates MAD to STD and then back to MAD.

As the number of dimensions (variables) increases, the sample space gets bigger. If holding the order of the norm (moment) constant, the norm occupies a decreasing portion of the space. Rising norms (higher moments) occupy a larger portion of the space. So the ratio of higher moments to lower moments increases with dimensionality, meaning you need much more data to fill out the larger space to learn about higher moments from samples.

A power law distribution has survival function P(X>x)~L(x)x^-a where ‘a’ is the tail exponent and L(x) is a slowly varying function that converges to a constant C (the Karamata constant) such that L(Kx)/L(x) -> 1 for a large enough x. The survival function decays asymptotically with slope ‘a’ (linear decline) in the tails. This means P(X>Kx)/P(X>x) -> some constant i.e. self-similarity (fractal), meaning the ratio is unchanged as you move deeper in the tail (so a change in x doesn’t matter, only K, beyond the Karamata point). This generates power law tails. Before the Karamata point, the distribution may be slowly varying. If the distribution is power law (regularly varying), then moments greater than ‘a’ will be infinite.

When regressing a fat tailed variable on a thin tailed variable, the R^2 can be calculated, but it will be stochastic (sample-dependent) and artificially high in-sample. As the expected squared residual (i.e. the variance of the dependent variable) approaches infinity, the expected R^2 approaches 0. So the specified equation will perform poorly out-of-sample; and given that R is bounded between 0 and 1, expected R^2 will only converge with large sample. Log transformation can fix things if it is the exact required transformation, but errors arise if not.

Lack of convergence in distribution in finite samples is exactly why “naïve” 1/n diversification works. MPT only works with fast convergence; hence over-diversification is desired to get closer to convergence. Note that reshuffling the S&P 500 returns will give finite fourth moment, meaning it’s precisely volatility clustering that makes the tails so fat. If kurtosis exists, it eventually converges to Gaussian as one lengthens the return horizon (day to month, etc.). Not so if kurtosis is infinite. Note also that the left tail of market returns is fatter than the right, and single observations account for a large amount of kurtosis (i.e. the max quartic (X^4; non-central kurtosis) accounts for about 80% of kurtosis; not so with Gaussian, for which the max quartic should account for .008+-.0028 of kurtosis).

Binary bets map to Heaviside function, while continuous bets map to a more linear function (ReLU on up-side). It’s practically impossible to hedge a continuous exposure with a binary bet, or vice-versa. You would need infinite binary bets, since you don’t know in advance what size to make the binary payoff. The delta of binary bets (nondifferentiable Dirac delta function) does not match the delta of vanilla options, which are at least once differentiable and have deltas. Consolidation of beliefs via prices (Hayek knowledge argument) doesn’t lead to prediction markets, as prices are not binary bets.

The body, intermediate area, and tails on distributions can be identified by the points of zero convexity. It follows that the tail portion is convex to errors in the estimation of the scale parameter. As kurtosis increases, intermediate values become less likely, while events inside one STD become more likely (along with the tail). As a result, thick tails lower the value of a binary option, while raising the value of a vanilla option. So it’s improper to accuse one of underestimating probabilities, as it’s rational to not separate probability from payoff, even if probability is miscalibrated. In other words, if a mistake doesn’t cost you anything or even helps you survive, it’s not really a mistake; if it does cost you something but has been around for a long time, there may be a hidden evolutionary advantage.

As the vol of some underlying asset increases (increasingly turning data into noise rather than signal), arbitrage pushes the corresponding binary option value to approach 50 delta and become less variable over the remaining time to expiration (the binary option value can’t vary more than the underlying variability). The higher the uncertainty of the underlying asset, the lower the binary option vol. A continuously made forecast must be a martingale (no drift) to avoid arbitrage, assuming forecasted probabilities are tradable. If a binary option price varies too much, one can make a guaranteed profit through replication (per De Finetti).

NNT highlights how certain researchers conflate the expected payoff in tail with the payoff at some tail threshold multiplied by the probability in the tail. This applies to CVA since it uses a static loss given default (which should instead be conditioned on a default event since the conditional value of collateral is likely far less than unconditional). This problem is amplified under parameter uncertainty due to Jensen’s Inequality. The distribution of a series of binary bets and brier score (probability calibration) is normal, while the distribution of real world payoffs (and M4 and M5 measures) match the distribution of the underlying variable.

The Gini coefficient is the mean expected deviation between any two data values scaled by 2 times the mean. It is asymptotically normal if the variance of the data is finite. With infinite variance (as with wealth/income), the limiting distribution for the nonparametric Gini loses normality and symmetry, shifting towards a right-skewed and fatter-tailed limit, resulting in a downward bias in finite samples (since the mode is less than the mean with right skew; increasingly so as the tail exponent decreases). Using [EVT] MLE with a parametric Gini (exponential family) in the presence of fat tails reduces bias, maintains asymptotic normality, and is asymptotically efficient.

Sample measures of top centile contribution to total are downward biased and sensitive to sample size (i.e. unstable) under fat tails. As sample size increases year over year, sample measures will converge higher as bias is reduced, giving the appearance of increase over time. The bias stems from the response of sample measures to new samples data. If the new data point is in the centile/region of interest, the sample estimate is concave to the new data. If the new data point is outside the region of interest, the sample estimate is convex to the new data. On net, the concavity effect is stronger, constituting an upper bound in finite samples which clips large deviations. The fatter the tail, the stronger the concavity effect.

The weighted average of a concentration measure across subsamples produces a downward bias in the measure for the full sample; e.g. aggregating subsamples into a single sample gives a higher top 1% wealth concentration than averaging the top 1% wealth concentration measures across subsamples. Global centile concentration would thus be higher than the average across all countries. So concentration estimates are superadditive. This also partially explains the apparent rise in concentration over time (as the size of the unit measured increases).

Uncertainty about uncertainty can lead to fat tails (epistemic). Integrating the PDF from 0 to inf across values of the scale (PDF for the scale) gives the error rate for the scale. (i.e. integrating once gives second-order STD, or STD of STD). Integrating N times gives the Nth order error rate, or the error rate of the error rate…etc. for N recursions. If the error rate is a positive constant or increasing from recursion to recursion, the second moment and all higher moments explode and become infinite as N->inf (tho the first moment does not explode). In this case, one can’t use distributions in the L^2 norm, as the distribution approaches power law. If the error rate decays, moments (and L^2 norm) are fine.

p-values for a t-test have a right-skewed distribution, making most realization of the p-value (from random samples) significantly below the true p-value (i.e. expected p-value via the LLN across the ensemble of possible samples), erroneously signaling significance. The right-skew raises the average and makes measures of dispersion in L^1 and L^2 (and higher norms) vary significantly with different true p-values.

The precautionary principle is necessary for higher order units (ecosystem, humanity, etc.) that do not “renew” the way lower order units do (individual people, animals, goods, etc.). With repeated exposure to a low-probability event, its probability of occurrence will approach 1 over time. If one’s exposure f(x) has an absorbing barrier, they must focus on time probability (path dependence) rather than ensemble probability (path independence). Since financial asset prices, particularly equities, are non-ergodic (time average <> ensemble average due to fat tails), one is not guaranteed the return of the market unconditionally. Hence the myopic loss aversion explanation (increased sensitivity to losses and less willingness to accept risk the more often you check performance) of the equity risk premium puzzle falls apart. Risks accumulate for individuals, making it rational to be loss averse and avoid tail risks.

Hyperbolic discounting as an explanation for time inconsistency of preferences is often considered irrational. However, if a future event is uncertain, one may prefer immediacy; but conditional on being around at a future date, reverse this preference (making hyperbolic discounting rational). If the discount rate is stochastic (e.g. gamma distributed), then intertemporal preferences flatten in the future no matter how steep they are at present, which explains the perceived drop in the discount rate.

The BSM model is really an argument to remove the expectation (risk-based drift) of the underlying security from the option pricing formula, since an option can be turned into a risk-free instrument via dynamic hedging. Risk-premium is then reflected in stock price. Bachelier’s model is based on actuarial expectation of final payoff, not continuous time dynamic hedging (as with BSM). With dynamic hedging, higher order terms explaining changes in portfolio value disappear rapidly since the underlying is assumed Gaussian (i.e. all moments converge), collapsing the option into a deterministic payoff (at the limit of delta-t -> 0, implying a continuous hedge frequency). It only applies at the limit because the hedge ratio is computed at time t, but there is a non-anticipating difference between the price at time t and the resulting price one time increment later. If the underlying is power law distributed, or has infinite moments, higher order terms become significant (especially for strikes away from the money). In this case, dynamic hedging does not remove risk, no matter the hedge frequency, making the option payoff remain stochastic.

Aggregation of market returns (from daily to weekly, etc.) does not achieve normality via CLT in real time. The preasymptotics of fractal distributions are such that the CLT is too slow. Further, with longer return periods there is less data, meaning fewer tail episodes, giving the illusion of thinner tails in-sample. Also, aggregation is at odds with the dynamic hedging approach to option pricing, which requires high frequency data. If a distribution is fractal at the time of a dynamic hedge, higher moments are infinite, making the hedge ineffective (a Taylor expansion is not possible as higher order moments are explosive).

Use of a vol surface essentially treats options as heteroskedastic (i.e. mix of Gaussians); after all, a scalable distribution with infinite variance can be expressed as a mix of Gaussians. But the vol surface is inconsistent with the BSM assumptions (constant variance). If options can be hedged with other options, pricing will be supply/demand based, contrasting with the assumptions of BSM (which would imply dealers simply manufacture new dynamically hedged portfolios as a perfect substitute for options in response to demand; so option traders do not estimate the odds of rare events by pricing out-of-the-money options).

Contact us

Find us at the office

Humble- Micallef street no. 52, 81559 Jakarta, Indonesia

Give us a ring

Arieal Keswick
+27 450 860 545
Mon - Fri, 9:00-18:00

Reach out