Brief overview of continuous probability distributions.
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
np.seterr(divide="ignore", invalid="ignore")
plt.rcParams["figure.figsize"] = (16, 8)
A continuous probability distribution is the probability distribution of real number variables with uncountably many possible values. It means that a random variable $X$ has a continuous distribution probability if there is a funtion $f:\mathbb{R} \rightarrow [0,\infty]$ for each interval $[a,b]\subset\mathbb{R}$.
The probability density funtion (PDF) is a function whose value represents a relative likelihood that the value of the random variable would be close to that sample. To calculate $X$ belonging to the interval $[a,b]$:
$$ \large f(x) = Pr(a \leq X \leq b)=\int_a^b g(x)dx $$A cumulative distribution function (CDF) calculates the probability of a random observation being less than or equal a certain value. It can be expressed as:
$$ \large F(x)=Pr(X \leq x)=\int_{-\infty}^x f(t) dt $$The normal distribution is one of the most used to model natural phenomena in general. It has as parameters $\mu$ (mean or expectation) and $\sigma$ (standard deviation) or $\sigma^2$ (variation), and its probability density function can be expressed as:
$$ \large f(x;\mu,\sigma^2)=\frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}=\frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2} $$def normalDistribution(x, mu, sigma):
norm = 1/(sigma*(2*np.pi)**0.5)
return norm*np.exp(-1/2*((x - mu)/(sigma))**2)
x = np.linspace(-10, 10, 1000)
# Distribution A
mu, sigma = 0, 1
plt.plot(
x, normalDistribution(x, mu, sigma),
label=f'$\mu={mu} \\quad|\\quad \sigma={sigma}$'
)
# Distribution B
mu, sigma = 5, 1
plt.plot(
x, normalDistribution(x, mu, sigma),
label=f'$\mu={mu} \\quad|\\quad \sigma={sigma}$'
)
# Distribution C
mu, sigma = 0, 2
plt.plot(
x, normalDistribution(x, mu, sigma),
label=f'$\mu={mu} \\quad|\\quad \sigma={sigma}$'
)
# Visualization
plt.xlabel(r'$x$')
plt.ylabel(r'$f(x)$')
plt.legend()
plt.show()
The cumulative distribution function of the gaussian distribution (also denoted with $\Phi$) can be expressed as:
$$ \large \Phi(x)=\frac{1}{\sqrt{2\pi}}\int_{-\infty}^{x}e^{-\frac{t^2}{2}}dx $$$$ \large F(x;\mu,\sigma^2)=\Phi\left(\frac{x-\mu}{\sigma}\right) $$def Phi(x):
t = x
dt = t[1] - t[0]
ft = np.exp(-t**2/2)
norm = 1/((2*np.pi)**0.5)
RANGE = np.arange(x.size) + 1
integ = np.array([ft[:x].sum() for x in RANGE])
return norm*integ*dt
def cumulativeNormalDistribution(x, mu, sigma):
Phi_ = Phi((x - mu)/(sigma*2**0.5))
return 1/2*(1 + Phi_)
x = np.linspace(-10, 10, 1000)
# Distribution A
mu, sigma = 0, 1
plt.plot(
x, cumulativeNormalDistribution(x, mu, sigma),
label=f'$\mu={mu} \\quad|\\quad \sigma={sigma}$'
)
# Distribution B
mu, sigma = 5, 1
plt.plot(
x, cumulativeNormalDistribution(x, mu, sigma),
label=f'$\mu={mu} \\quad|\\quad \sigma={sigma}$'
)
# Distribution C
mu, sigma = 0, 2
plt.plot(
x, cumulativeNormalDistribution(x, mu, sigma),
label=f'$\mu={mu} \\quad|\\quad \sigma={sigma}$'
)
# Visualization
plt.xlabel(r'$x$')
plt.ylabel(r'$F(x)$')
plt.legend()
plt.show()
A chi-squared distribution (or $\chi^2$-distribution) with $k$ degrees of freedom represents the sum of squares of $k$ normally distributed random variables, and its probability density function can be expressed as:
$$ \large f(x;k)= \begin{cases} \frac{1}{2^{\frac{k}{2}}\Gamma\left(\frac{k}{2}\right)} x^{\frac{k}{2} - 1} e^{-\frac{x}{2}} &, x>0; \\ 0 &, \text{otherwise.} \end{cases} $$where $\Gamma$ is the Gamma function expressed as:
$$ \large \Gamma(s)=\int_{0}^{\infty}t^{s-1}e^{-t}dt $$def Gamma(s, infty=100, dt=1e-4):
t = np.arange(0, infty, dt)
s_1 = s - 1 if s >= 1 else s
ft = t**s_1*np.exp(-t)
return ft.sum()*dt
def chiSquaredDistribution(x, k):
norm = 1/(2**(k/2)*Gamma(k/2))
x_ = x**(k/2 - 1)
e_ = np.exp(-x/2)
return np.where(x >= 0, norm*x_*e_, 0)
x = np.linspace(0, 20, 1000)
# Distributions at k
for k in [2, 3, 5, 9, 15]:
plt.plot(
x, chiSquaredDistribution(x, k),
label=f'$k={k}$'
)
# Visualization
plt.xlabel(r'$x$')
plt.ylabel(r'$f(x)$')
plt.ylim(0, 0.5)
plt.legend()
plt.show()
The cumulative distribution function of the chi-squared distribution can be expressed as:
$$ \large F(x;k)=\frac{1}{\Gamma\left(\frac{k}{2}\right)} \gamma\left(\frac{x}{2},\frac{k}{2}\right) $$where $\gamma$ is the lower incomplete gamma function expressed as:
$$ \large \gamma(x;s)=\int_{0}^{x}t^{s-1}e^{-t}dt $$def li_gamma(x, s):
t = x
dt = t[1] - t[0]
ft = t**(s - 1)*np.exp(-t)
RANGE = np.arange(x.size) + 1
integ = np.array([ft[:x].sum() for x in RANGE])
return integ*dt
def cumulativeChiSquaredDistribution(x, k):
norm = 1/Gamma(k/2)
return norm*li_gamma(x/2, k/2)
x = np.linspace(0, 20, 1000)
# Distributions at k
for k in [2, 3, 5, 9, 15]:
plt.plot(
x, cumulativeChiSquaredDistribution(x, k),
label=f'$k={k}$'
)
# Visualization
plt.xlabel(r'$x$')
plt.ylabel(r'$F(x)$')
plt.legend()
plt.show()
The student's t-distribution is commonly used when the sample size is small and the populations's standard deviation is unknown, and its probability density function can be expressed as:
$$ \large f(x;v)=\frac{\Gamma\left(\frac{v+1}{2}\right)}{\sqrt{v\pi}\Gamma\left(\frac{v}{2}\right)}\left(1+\frac{x^2}{v}\right)^{-\frac{v+1}{2}} $$where:
def studentsTDistribution(x, v):
norm = Gamma((v + 1)/2)/((v*np.pi)**0.5*Gamma(v/2))
return norm*(1 + x**2/v)**(-(v + 1)/2)
x = np.linspace(-5, 5, 1000)
# Distributions at v
for v in [2, 5, 20, 100]:
plt.plot(
x, studentsTDistribution(x, v),
label=f'$v={v}$'
)
# Visualization
plt.xlabel(r'$x$')
plt.ylabel(r'$f(x)$')
plt.legend()
plt.show()
The cumulative distribution function of the students's t-distribution can be expressed as:
$$ \large F(x;v)=\frac{1}{2} + x\frac{\Gamma\left(\frac{v+1}{2}\right)}{\sqrt{v\pi}\Gamma\left(\frac{v}{2}\right)}{}_2F_1\left(\frac{1}{2},\frac{v+1}{2};\frac{3}{2};-\frac{x^2}{v}\right) $$where ${}_2F_1$ is a particular case of the hypergeometric function defined for $|z| < 1$ by the power series:
$$ \large {}_2F_1(a,b;c;z)=\sum_{n=0}^{\infty}\frac{a^{(n)}b^{(n)}}{c^{(n)}}\frac{z^n}{n!} $$Here $x^{(n)}$ is the rising factorial defined as:
$$ \large x^{(n)}=x(x+1)(x+2)...(x+n-1)=\prod_{k=0}^{n-1}(x+k) $$def factorial(n):
if n <= 1:
return 1
return np.prod(np.arange(n) + 1, dtype=np.float64)
def risingFactorial(x, n):
RANGE = np.arange(n)
return np.prod([x + k for k in RANGE], dtype=np.float64)
def hypergeometric(a, b, c, z, infty=50):
f = lambda n: factorial(n)
a_ = lambda n: risingFactorial(a, n)
b_ = lambda n: risingFactorial(b, n)
c_ = lambda n: risingFactorial(c, n)
RANGE = np.arange(infty)
ABCZN = [a_(n)*b_(n)/c_(n)*z**n/f(n) for n in RANGE]
_2F1 = np.sum(ABCZN, axis=0)
# Given the definition is only for |z| < 1
# and to deal with the analytic continuation problem
# lets use the function from scipy for a while.
import scipy.special as sc
return sc.hyp2f1(a, b, c, z)
def cumulativeStudentsTDistribution(x, v):
norm = Gamma((v + 1)/2)/((v*np.pi)**0.5*Gamma(v/2))
_2F1 = hypergeometric(1/2, (v + 1)/2, 3/2, -x**2/v)
return 1/2 + x*norm*_2F1
x = np.linspace(-5, 5, 1000, dtype=np.float64)
# Distributions at v
for v in [2, 5, 20, 100]:
plt.plot(
x, cumulativeStudentsTDistribution(x, v),
label=f'$v={v}$'
)
# Visualization
plt.xlabel(r'$x$')
plt.ylabel(r'$F(x)$')
plt.legend()
plt.show()
A f-distribution arises frequently as the null distribution of a test statistic, mostly in the analysis of variance (ANOVA) and other F-tests. Its probability density function for the degrees of freedom $d_1$ and $d_2$ can be expressed as:
$$ \large f(x;d_1,d_2)=\frac{\sqrt{\frac{(d_1x)^{d_1}d_2^{d_2}}{(d_1x+d_2)^{d_1+d_2}}}}{xB\left(\frac{d_1}{2},\frac{d_2}{2}\right)} $$where $B$ is the beta function expressed as:
$$ \large B(z_1,z_2)=\int_0^1 t^{z_1-1}(1-t)^{z_2-1}dt $$def Beta(z1, z2, dt=1e-4):
z1, z2 = max(z1, 0), max(z2, 0)
t = np.linspace(0, 1, int(1/dt) + 1)
ft = t**(z1 - 1)*(1 - t)**(z2 - 1)
ft[~np.isfinite(ft)] = 0
return ft.sum()*dt
def fDistribution(x, d1, d2):
num1 = (d1*x)**d1*d2**d2
den1 = (d1*x + d2)**(d1 + d2)
num2 = (num1/den1)**0.5
den2 = x*Beta(d1/2, d2/2)
return num2/den2
x = np.linspace(0, 3, 500)
# Distributions at d1 and d2
for [d1, d2] in [(1, 1), (2, 1), (5, 2), (10, 3), (100, 5)]:
plt.plot(
x, fDistribution(x, d1, d2),
label=f'$d_1={d1} \quad | \quad d_2={d2}$'
)
# Visualization
plt.xlabel(r'$x$')
plt.ylabel(r'$f(x)$')
plt.ylim([0, 2])
plt.legend()
plt.show()
The cumulative distribution function of the f-distribution can be expressed as:
$$ \large F(x;d_1,d_2)=I_{\frac{d_1x}{d_1x+d_2}}\left(\frac{d_1}{2},\frac{d_2}{2}\right) $$where $I_x$ is the regularized incomplete beta function expressed as:
$$ \large I_x(a,b)=\frac{B(x;a,b)}{B(a,b)} $$which is defined in terms of the incomplete beta function and the complete beta function (or simply beta function). The incomplete beta function is defined as:
$$ \large B(x;a,b)=\int_0^{x} t^{a-1}(1-t)^{b-1}dt $$def i_beta(x, a, b):
t = x
dt = t[1] - t[0]
ft = t**(a - 1)*(1 - t)**(b - 1)
ft[~np.isfinite(ft)] = 0
RANGE = np.arange(x.size) + 1
integ = np.array([ft[:x].sum() for x in RANGE])
return integ*dt
def ri_beta(x, a, b):
Ix = i_beta(x, a, b)/Beta(a, b)
# For any reason, the incomplete beta ..
# .. function is not working as expected.
# Check back in the future.
# Lets use the function from scipy for a while.
import scipy.special as sc
return sc.betainc(a, b, x)
def cumulativeFDistribution(x, d1, d2):
d1d2x = d1*x/(d1*x + d2)
return ri_beta(d1d2x, d1/2, d2/2)
x = np.linspace(0, 3, 500)
# Distributions at d1 and d2
for [d1, d2] in [(1, 1), (2, 1), (5, 2), (10, 3), (100, 5)]:
plt.plot(
x, cumulativeFDistribution(x, d1, d2),
label=f'$d_1={d1} \quad | \quad d_2={d2}$'
)
# Visualization
plt.xlabel(r'$x$')
plt.ylabel(r'$F(x)$')
plt.ylim([0, 1])
plt.legend()
plt.show()