Probability Distribution - Continuous¶


  • Author: Diego Inácio
  • GitHub: github.com/diegoinacio
  • Notebook: distributions_continuous.ipynb

Brief overview of continuous probability distributions.

In [1]:
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt

import numpy as np
import pandas as pd
In [2]:
np.seterr(divide="ignore", invalid="ignore")
plt.rcParams["figure.figsize"] = (16, 8)

Introduction¶


A continuous probability distribution is the probability distribution of real number variables with uncountably many possible values. It means that a random variable $X$ has a continuous distribution probability if there is a funtion $f:\mathbb{R} \rightarrow [0,\infty]$ for each interval $[a,b]\subset\mathbb{R}$.

Probability density function¶


The probability density funtion (PDF) is a function whose value represents a relative likelihood that the value of the random variable would be close to that sample. To calculate $X$ belonging to the interval $[a,b]$:

$$ \large f(x) = Pr(a \leq X \leq b)=\int_a^b g(x)dx $$

Cumulative distribution function¶


A cumulative distribution function (CDF) calculates the probability of a random observation being less than or equal a certain value. It can be expressed as:

$$ \large F(x)=Pr(X \leq x)=\int_{-\infty}^x f(t) dt $$

Normal distribution¶


The normal distribution is one of the most used to model natural phenomena in general. It has as parameters $\mu$ (mean or expectation) and $\sigma$ (standard deviation) or $\sigma^2$ (variation), and its probability density function can be expressed as:

$$ \large f(x;\mu,\sigma^2)=\frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}=\frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2} $$
In [3]:
def normalDistribution(x, mu, sigma):
    norm = 1/(sigma*(2*np.pi)**0.5)
    return norm*np.exp(-1/2*((x - mu)/(sigma))**2)
In [4]:
x = np.linspace(-10, 10, 1000)

# Distribution A
mu, sigma = 0, 1

plt.plot(
    x, normalDistribution(x, mu, sigma), 
    label=f'$\mu={mu} \\quad|\\quad \sigma={sigma}$'
)

# Distribution B
mu, sigma = 5, 1

plt.plot(
    x, normalDistribution(x, mu, sigma), 
    label=f'$\mu={mu} \\quad|\\quad \sigma={sigma}$'
)

# Distribution C
mu, sigma = 0, 2

plt.plot(
    x, normalDistribution(x, mu, sigma), 
    label=f'$\mu={mu} \\quad|\\quad \sigma={sigma}$'
)

# Visualization
plt.xlabel(r'$x$')
plt.ylabel(r'$f(x)$')
plt.legend()
plt.show()

The cumulative distribution function of the gaussian distribution (also denoted with $\Phi$) can be expressed as:

$$ \large \Phi(x)=\frac{1}{\sqrt{2\pi}}\int_{-\infty}^{x}e^{-\frac{t^2}{2}}dx $$$$ \large F(x;\mu,\sigma^2)=\Phi\left(\frac{x-\mu}{\sigma}\right) $$
In [5]:
def Phi(x):
    t = x
    dt = t[1] - t[0]
    ft = np.exp(-t**2/2)
    norm = 1/((2*np.pi)**0.5)
    RANGE = np.arange(x.size) + 1
    integ = np.array([ft[:x].sum() for x in RANGE])
    return norm*integ*dt

def cumulativeNormalDistribution(x, mu, sigma):
    Phi_ = Phi((x - mu)/(sigma*2**0.5))
    return 1/2*(1 + Phi_)
In [6]:
x = np.linspace(-10, 10, 1000)

# Distribution A
mu, sigma = 0, 1

plt.plot(
    x, cumulativeNormalDistribution(x, mu, sigma), 
    label=f'$\mu={mu} \\quad|\\quad \sigma={sigma}$'
)

# Distribution B
mu, sigma = 5, 1

plt.plot(
    x, cumulativeNormalDistribution(x, mu, sigma), 
    label=f'$\mu={mu} \\quad|\\quad \sigma={sigma}$'
)

# Distribution C
mu, sigma = 0, 2

plt.plot(
    x, cumulativeNormalDistribution(x, mu, sigma), 
    label=f'$\mu={mu} \\quad|\\quad \sigma={sigma}$'
)

# Visualization
plt.xlabel(r'$x$')
plt.ylabel(r'$F(x)$')
plt.legend()
plt.show()

Chi-squared distribution¶


A chi-squared distribution (or $\chi^2$-distribution) with $k$ degrees of freedom represents the sum of squares of $k$ normally distributed random variables, and its probability density function can be expressed as:

$$ \large f(x;k)= \begin{cases} \frac{1}{2^{\frac{k}{2}}\Gamma\left(\frac{k}{2}\right)} x^{\frac{k}{2} - 1} e^{-\frac{x}{2}} &, x>0; \\ 0 &, \text{otherwise.} \end{cases} $$

where $\Gamma$ is the Gamma function expressed as:

$$ \large \Gamma(s)=\int_{0}^{\infty}t^{s-1}e^{-t}dt $$
In [7]:
def Gamma(s, infty=100, dt=1e-4):
    t = np.arange(0, infty, dt)
    s_1 = s - 1 if s >= 1 else s
    ft = t**s_1*np.exp(-t)
    return ft.sum()*dt

def chiSquaredDistribution(x, k):
    norm = 1/(2**(k/2)*Gamma(k/2))
    x_ = x**(k/2 - 1)
    e_ = np.exp(-x/2)
    return np.where(x >= 0, norm*x_*e_, 0)
In [8]:
x = np.linspace(0, 20, 1000)

# Distributions at k
for k in [2, 3, 5, 9, 15]:
    plt.plot(
        x, chiSquaredDistribution(x, k), 
        label=f'$k={k}$'
    )

# Visualization
plt.xlabel(r'$x$')
plt.ylabel(r'$f(x)$')
plt.ylim(0, 0.5)
plt.legend()
plt.show()

The cumulative distribution function of the chi-squared distribution can be expressed as:

$$ \large F(x;k)=\frac{1}{\Gamma\left(\frac{k}{2}\right)} \gamma\left(\frac{x}{2},\frac{k}{2}\right) $$

where $\gamma$ is the lower incomplete gamma function expressed as:

$$ \large \gamma(x;s)=\int_{0}^{x}t^{s-1}e^{-t}dt $$
In [9]:
def li_gamma(x, s):
    t = x
    dt = t[1] - t[0]
    ft = t**(s - 1)*np.exp(-t)
    RANGE = np.arange(x.size) + 1
    integ = np.array([ft[:x].sum() for x in RANGE])
    return integ*dt

def cumulativeChiSquaredDistribution(x, k):
    norm = 1/Gamma(k/2)
    return norm*li_gamma(x/2, k/2)
In [10]:
x = np.linspace(0, 20, 1000)

# Distributions at k
for k in [2, 3, 5, 9, 15]:
    plt.plot(
        x, cumulativeChiSquaredDistribution(x, k), 
        label=f'$k={k}$'
    )

# Visualization
plt.xlabel(r'$x$')
plt.ylabel(r'$F(x)$')

plt.legend()
plt.show()

Student's t-distribution¶


The student's t-distribution is commonly used when the sample size is small and the populations's standard deviation is unknown, and its probability density function can be expressed as:

$$ \large f(x;v)=\frac{\Gamma\left(\frac{v+1}{2}\right)}{\sqrt{v\pi}\Gamma\left(\frac{v}{2}\right)}\left(1+\frac{x^2}{v}\right)^{-\frac{v+1}{2}} $$

where:

  • $\Gamma$ is the gamma function;
  • parameter $v$ is the number of degrees of freedom.
In [11]:
def studentsTDistribution(x, v):
    norm = Gamma((v + 1)/2)/((v*np.pi)**0.5*Gamma(v/2))
    return norm*(1 + x**2/v)**(-(v + 1)/2)
In [12]:
x = np.linspace(-5, 5, 1000)

# Distributions at v
for v in [2, 5, 20, 100]:
    plt.plot(
        x, studentsTDistribution(x, v), 
        label=f'$v={v}$'
    )

# Visualization
plt.xlabel(r'$x$')
plt.ylabel(r'$f(x)$')

plt.legend()
plt.show()

The cumulative distribution function of the students's t-distribution can be expressed as:

$$ \large F(x;v)=\frac{1}{2} + x\frac{\Gamma\left(\frac{v+1}{2}\right)}{\sqrt{v\pi}\Gamma\left(\frac{v}{2}\right)}{}_2F_1\left(\frac{1}{2},\frac{v+1}{2};\frac{3}{2};-\frac{x^2}{v}\right) $$

where ${}_2F_1$ is a particular case of the hypergeometric function defined for $|z| < 1$ by the power series:

$$ \large {}_2F_1(a,b;c;z)=\sum_{n=0}^{\infty}\frac{a^{(n)}b^{(n)}}{c^{(n)}}\frac{z^n}{n!} $$

Here $x^{(n)}$ is the rising factorial defined as:

$$ \large x^{(n)}=x(x+1)(x+2)...(x+n-1)=\prod_{k=0}^{n-1}(x+k) $$
In [13]:
def factorial(n):
    if n <= 1:
        return 1
    return np.prod(np.arange(n) + 1, dtype=np.float64)

def risingFactorial(x, n):
    RANGE = np.arange(n)
    return np.prod([x + k for k in RANGE], dtype=np.float64)

def hypergeometric(a, b, c, z, infty=50):
    f = lambda n: factorial(n)
    a_ = lambda n: risingFactorial(a, n)
    b_ = lambda n: risingFactorial(b, n)
    c_ = lambda n: risingFactorial(c, n)
    RANGE = np.arange(infty)
    ABCZN = [a_(n)*b_(n)/c_(n)*z**n/f(n) for n in RANGE]
    _2F1 = np.sum(ABCZN, axis=0)
    # Given the definition is only for |z| < 1
    # and to deal with the analytic continuation problem
    # lets use the function from scipy for a while.
    import scipy.special as sc
    return sc.hyp2f1(a, b, c, z)

def cumulativeStudentsTDistribution(x, v):
    norm = Gamma((v + 1)/2)/((v*np.pi)**0.5*Gamma(v/2))
    _2F1 = hypergeometric(1/2, (v + 1)/2, 3/2, -x**2/v)
    return 1/2 + x*norm*_2F1
In [14]:
x = np.linspace(-5, 5, 1000, dtype=np.float64)

# Distributions at v
for v in [2, 5, 20, 100]:
    plt.plot(
        x, cumulativeStudentsTDistribution(x, v), 
        label=f'$v={v}$'
    )

# Visualization
plt.xlabel(r'$x$')
plt.ylabel(r'$F(x)$')

plt.legend()
plt.show()

F-distribution¶


A f-distribution arises frequently as the null distribution of a test statistic, mostly in the analysis of variance (ANOVA) and other F-tests. Its probability density function for the degrees of freedom $d_1$ and $d_2$ can be expressed as:

$$ \large f(x;d_1,d_2)=\frac{\sqrt{\frac{(d_1x)^{d_1}d_2^{d_2}}{(d_1x+d_2)^{d_1+d_2}}}}{xB\left(\frac{d_1}{2},\frac{d_2}{2}\right)} $$

where $B$ is the beta function expressed as:

$$ \large B(z_1,z_2)=\int_0^1 t^{z_1-1}(1-t)^{z_2-1}dt $$
In [15]:
def Beta(z1, z2, dt=1e-4):
    z1, z2 = max(z1, 0), max(z2, 0)
    t = np.linspace(0, 1, int(1/dt) + 1)
    ft = t**(z1 - 1)*(1 - t)**(z2 - 1)
    ft[~np.isfinite(ft)] = 0
    return ft.sum()*dt

def fDistribution(x, d1, d2):
    num1 = (d1*x)**d1*d2**d2
    den1 = (d1*x + d2)**(d1 + d2)
    num2 = (num1/den1)**0.5
    den2 = x*Beta(d1/2, d2/2)
    return num2/den2
In [16]:
x = np.linspace(0, 3, 500)

# Distributions at d1 and d2
for [d1, d2] in [(1, 1), (2, 1), (5, 2), (10, 3), (100, 5)]:
    plt.plot(
        x, fDistribution(x, d1, d2), 
        label=f'$d_1={d1} \quad | \quad d_2={d2}$'
    )

# Visualization
plt.xlabel(r'$x$')
plt.ylabel(r'$f(x)$')
plt.ylim([0, 2])

plt.legend()
plt.show()

The cumulative distribution function of the f-distribution can be expressed as:

$$ \large F(x;d_1,d_2)=I_{\frac{d_1x}{d_1x+d_2}}\left(\frac{d_1}{2},\frac{d_2}{2}\right) $$

where $I_x$ is the regularized incomplete beta function expressed as:

$$ \large I_x(a,b)=\frac{B(x;a,b)}{B(a,b)} $$

which is defined in terms of the incomplete beta function and the complete beta function (or simply beta function). The incomplete beta function is defined as:

$$ \large B(x;a,b)=\int_0^{x} t^{a-1}(1-t)^{b-1}dt $$
In [17]:
def i_beta(x, a, b):
    t = x
    dt = t[1] - t[0]
    ft = t**(a - 1)*(1 - t)**(b - 1)
    ft[~np.isfinite(ft)] = 0
    RANGE = np.arange(x.size) + 1
    integ = np.array([ft[:x].sum() for x in RANGE])
    return integ*dt

def ri_beta(x, a, b):
    Ix = i_beta(x, a, b)/Beta(a, b)
    # For any reason, the incomplete beta ..
    # .. function is not working as expected.
    # Check back in the future.
    # Lets use the function from scipy for a while.
    import scipy.special as sc
    return sc.betainc(a, b, x)

def cumulativeFDistribution(x, d1, d2):
    d1d2x = d1*x/(d1*x + d2)
    return ri_beta(d1d2x, d1/2, d2/2)
In [18]:
x = np.linspace(0, 3, 500)

# Distributions at d1 and d2
for [d1, d2] in [(1, 1), (2, 1), (5, 2), (10, 3), (100, 5)]:
    plt.plot(
        x, cumulativeFDistribution(x, d1, d2), 
        label=f'$d_1={d1} \quad | \quad d_2={d2}$'
    )

# Visualization
plt.xlabel(r'$x$')
plt.ylabel(r'$F(x)$')
plt.ylim([0, 1])

plt.legend()
plt.show()