Overview of measures of position in sample or distributions.
In descriptive statistics, measures of position is an way to know where certain data point or range falls in a sample/population distribution. Most of the them are used to descriptively summarize the data as well as help them not be sensitive to the influence of some extreme observations, known as outliers.
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
Quantiles are cut points dividing regular ranges of the data into contiguous intervals with equal probabilities. In this way, they give rise to q-quantiles, which are established from cut-off points that determine the boundaries between consecutive subsets.
Some q-quantiles have special names. Some of them are:
Median: When it has just 2-quantiles. It is a very well known measue of central location and is located exactly in the middle of the data range, dividing the bottom 50% and the top 50%;
Quartile: When it has just 4-quantiles. It has 3 cutting points [$Q_1$, $Q_2$, $Q_3$] and divides the data into 4 regular ranges;
Decile: When it has just 10-quantiles. It divides the ordered data into 10 equal parts; [$D_1$, $D_2$, ..., $D_9$]
Percentile: When it has just 100-quantiles. Each range contains 1% of the number to the amount of ordered data. [$P_1$, $P_2$, ..., $P_{99}$]
For example, given the list of 5 numbers: 5, 87, 45, 32, 1
The first step is to sort all the elements: 1, 5, 32, 45, 87
Finally, we can get the quartile points, which are:
numbers = pd.Series([5, 87 , 45 , 32 ,1])
Q1 = numbers.quantile(0.25)
Q3 = numbers.quantile(0.75)
Q2 = numbers.quantile(0.50) # same as median
print(f'Q1 | First quartile is: {Q1}')
print(f'Q2 | Second quartile is: {Q2} (same as median)')
print(f'Q3 | Third quartile is: {Q3}')
Q1 | First quartile is: 5.0 Q2 | Second quartile is: 32.0 (same as median) Q3 | Third quartile is: 45.0
Notice that we can define the quantile by defining the relative postion using percent.
# Define range from 0-100
numbers = pd.Series(np.arange(101))
Q1 = numbers.quantile(0.25)
D3 = numbers.quantile(0.30)
P2 = numbers.quantile(0.02)
print(f'Q1 | First quartile is: {Q1}')
print(f'D3 | Third decile is: {D3}')
print(f'P2 | Second percentile is: {P2}')
Q1 | First quartile is: 25.0 D3 | Third decile is: 30.0 P2 | Second percentile is: 2.0
In the same way as in measures of central tendency, the number of set elements matters here. For example, if data has odd number of elements, it is the middle element (or $\frac{n}{2}$ th element). If data has even number of elements, it is the mean of the two center data ($\frac{n}{2}$ th and $\left[ \frac{n}{2} + 1 \right]$th).
However, sometimes (depending on the number of elements) the point of our measure of position is not exactly in the middle of 2 elements. Sometimes it falls closer to one than the other, so we have to interpolate those values in order to find the weighted result. Firstly, let's use a quantile value $q$, which ranges from 0 to 1.
Having the quantile value, we can find the position value $p$ as:
$$ \large p = q \cdot (n - 1) $$where $n$ is the number of elements in the set. Thus:
Considering our value $p$ falls between two elemens $a$ and $b$, the interpolated value $I$ can be:
$$ \large t = a - p \quad ; \quad I = (1 - t) \cdot v_a + t \cdot v_b $$where $v_a$ and $v_b$ are the values of the elements $a$ and $b$, respectively.
For example, given the list of 6 numbers: 5, 87, 45, 32, 1, 38
The next step is to sort all the elements: 1, 5, 32, 38, 45, 87
If we want to calculate the first quartile ($Q_1$ and $q=0.25$), the position is calculated as:
$$ \large \begin{align} p &= q \cdot (n - 1) \\ &= 0.25 \cdot (6 -1) \\ &= 0.25 \cdot 5 \\ p &= 1.25 \end{align} $$Having $p = 1.25$ we already know that the point is between the elements with index 1 (a) and 2 (b), which are 5 and 32, respectively. Thus:
$$ \large \begin{align} t &= a - p \\ &= 1 - 1.25 \\ t &= 0.25 \end{align} $$$$ \large \begin{align} I &= (1 - t) \cdot v_a + t \cdot v_b \\ &= (1 - 0.25) \cdot 5 + 0.25 \cdot 32 \\ &= 0.75 \cdot 5 + 0.25 \cdot 32 \\ &= 3.75 + 8 \\ I &= 11.75 \end{align} $$numbers = pd.Series([5, 87 , 45 , 32 ,1, 38])
Q1 = numbers.quantile(0.25)
print(f'Q1 | First quartile is: {Q1}')
Q1 | First quartile is: 11.75
The InterQuartile Range (or IQR) tells us where the “middle fifty” is in a data set. It is basically the difference between the first and the third quartiles:
$$ \large IQR = Q_3 - Q_1 $$Despite being a measure of dispersion, this range can provide us some useful understandings about data suche as InterQuartile Mean (or IQM). They are effective beceuse we can deal with outliers, which are extreme observations that can distort our analysis.
$$ \large IQM = \frac{2}{n} \sum_{i=\frac{n}{4}+1}^{\frac{3n}{4}} x_i $$# Sequence with outliers
numbers = pd.Series([1, 5, 6, 7, 8, 9, 32])
Q1 = numbers.quantile(0.25)
Q3 = numbers.quantile(0.75)
IQR = (numbers
.where(numbers >= Q1)
.where(numbers <= Q3)
).dropna()
SM = numbers.mean() # Sample mean
IQM = IQR.mean() # InterQuartile mean
print(f'SM is: {SM}')
print(f'IQM is: {IQM}')
SM is: 9.714285714285714 IQM is: 7.0
Box Plot is a graphical representation of our measures of position. Its anatomy shows us all the important values we cover here:
A very important point about box plot is the minimum and maximum value. They don't represent the extreme values of the sample (or $q=0$ and $q=1$). Instead they can be calculated with whiskers:
# Random normal distribution
np.random.seed(0)
numbers = pd.Series(np.random.normal(0, 4, 1000))
# Creating plot
fig = plt.figure(figsize =(10, 7))
plt.boxplot(numbers)
plt.show()