Measures of Position¶

Author: Diego Inácio
GitHub: github.com/diegoinacio
Notebook: measures_position.ipynb

Overview of measures of position in sample or distributions.

In descriptive statistics, measures of position is an way to know where certain data point or range falls in a sample/population distribution. Most of the them are used to descriptively summarize the data as well as help them not be sensitive to the influence of some extreme observations, known as outliers.

measures of position

In [1]:

%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

Quantiles¶

Quantiles are cut points dividing regular ranges of the data into contiguous intervals with equal probabilities. In this way, they give rise to q-quantiles, which are established from cut-off points that determine the boundaries between consecutive subsets.

Some q-quantiles have special names. Some of them are:

Median: When it has just 2-quantiles. It is a very well known measue of central location and is located exactly in the middle of the data range, dividing the bottom 50% and the top 50%;
Quartile: When it has just 4-quantiles. It has 3 cutting points [$Q_1$, $Q_2$, $Q_3$] and divides the data into 4 regular ranges;
Decile: When it has just 10-quantiles. It divides the ordered data into 10 equal parts; [$D_1$, $D_2$, ..., $D_9$]
Percentile: When it has just 100-quantiles. Each range contains 1% of the number to the amount of ordered data. [$P_1$, $P_2$, ..., $P_{99}$]

measures of position quartiles

For example, given the list of 5 numbers: 5, 87, 45, 32, 1

The first step is to sort all the elements: 1, 5, 32, 45, 87

Finally, we can get the quartile points, which are:

$Q_1$: quantile 25%. element value is 5;
$Q_2$: quantile 50% (same as median). element value is 32;
$Q_3$: quantile 75%. element value is 45;

In [2]:

numbers = pd.Series([5, 87 , 45 , 32 ,1])

Q1 = numbers.quantile(0.25)
Q3 = numbers.quantile(0.75)
Q2 = numbers.quantile(0.50) # same as median

print(f'Q1 |  First quartile is: {Q1}')
print(f'Q2 | Second quartile is: {Q2} (same as median)')
print(f'Q3 |  Third quartile is: {Q3}')

Notice that we can define the quantile by defining the relative postion using percent.

$Q_1$: first quartile is quantile 25%;
$D_3$: third decile is quantile 30%;
$P_2$: second percentile is quantile 2%;

In [3]:

# Define range from 0-100
numbers = pd.Series(np.arange(101))

Q1 = numbers.quantile(0.25)
D3 = numbers.quantile(0.30)
P2 = numbers.quantile(0.02)

print(f'Q1 |    First quartile is: {Q1}')
print(f'D3 |      Third decile is: {D3}')
print(f'P2 | Second percentile is: {P2}')

Number of set elements¶

In the same way as in measures of central tendency, the number of set elements matters here. For example, if data has odd number of elements, it is the middle element (or $\frac{n}{2}$ th element). If data has even number of elements, it is the mean of the two center data ($\frac{n}{2}$ th and $\left[ \frac{n}{2} + 1 \right]$th).

measures of central location median

However, sometimes (depending on the number of elements) the point of our measure of position is not exactly in the middle of 2 elements. Sometimes it falls closer to one than the other, so we have to interpolate those values in order to find the weighted result. Firstly, let's use a quantile value $q$, which ranges from 0 to 1.

$Q_1$ is the first quartile and $q = 0.25$;
$Q_3$ is the third quartile and $q = 0.75$;
$D_4$ is the fourth decile and $q = 0.4$;
$P_8$ is the eighth percentile and $q = 0.08$;
$q = 0$ and $q = 1$ are the minimum and maximum values, respectively.

Having the quantile value, we can find the position value $p$ as:

$$ \large p = q \cdot (n - 1) $$

where $n$ is the number of elements in the set. Thus:

If $p = 0.25$ the point is between the elements 0 and 1;
If $p = 1.75$ the point is between the elements 1 and 2;
If $p = 5.01$ the point is between the elements 5 and 6;
If $p = 11.99$ the point is between the elements 11 and 12.

Considering our value $p$ falls between two elemens $a$ and $b$, the interpolated value $I$ can be:

$$ \large t = a - p \quad ; \quad I = (1 - t) \cdot v_a + t \cdot v_b $$

where $v_a$ and $v_b$ are the values of the elements $a$ and $b$, respectively.

For example, given the list of 6 numbers: 5, 87, 45, 32, 1, 38

The next step is to sort all the elements: 1, 5, 32, 38, 45, 87

If we want to calculate the first quartile ($Q_1$ and $q=0.25$), the position is calculated as:

$$ \large \begin{align} p &= q \cdot (n - 1) \\ &= 0.25 \cdot (6 -1) \\ &= 0.25 \cdot 5 \\ p &= 1.25 \end{align} $$

Having $p = 1.25$ we already know that the point is between the elements with index 1 (a) and 2 (b), which are 5 and 32, respectively. Thus:

$$ \large \begin{align} t &= a - p \\ &= 1 - 1.25 \\ t &= 0.25 \end{align} $$$$ \large \begin{align} I &= (1 - t) \cdot v_a + t \cdot v_b \\ &= (1 - 0.25) \cdot 5 + 0.25 \cdot 32 \\ &= 0.75 \cdot 5 + 0.25 \cdot 32 \\ &= 3.75 + 8 \\ I &= 11.75 \end{align} $$

In [4]:

numbers = pd.Series([5, 87 , 45 , 32 ,1, 38])

Q1 = numbers.quantile(0.25)

print(f'Q1 | First quartile is: {Q1}')

InterQuartile Range and Outliers¶

The InterQuartile Range (or IQR) tells us where the “middle fifty” is in a data set. It is basically the difference between the first and the third quartiles:

$$ \large IQR = Q_3 - Q_1 $$

Despite being a measure of dispersion, this range can provide us some useful understandings about data suche as InterQuartile Mean (or IQM). They are effective beceuse we can deal with outliers, which are extreme observations that can distort our analysis.

$$ \large IQM = \frac{2}{n} \sum_{i=\frac{n}{4}+1}^{\frac{3n}{4}} x_i $$

In [5]:

# Sequence with outliers
numbers = pd.Series([1, 5, 6, 7, 8, 9, 32])

Q1 = numbers.quantile(0.25)
Q3 = numbers.quantile(0.75)

IQR = (numbers
    .where(numbers >= Q1)
    .where(numbers <= Q3)
).dropna()

SM = numbers.mean() # Sample mean
IQM = IQR.mean()    # InterQuartile mean

print(f'SM is: {SM}')
print(f'IQM is: {IQM}')

Box Plot¶

Box Plot is a graphical representation of our measures of position. Its anatomy shows us all the important values we cover here:

measures of position boxplot

A very important point about box plot is the minimum and maximum value. They don't represent the extreme values of the sample (or $q=0$ and $q=1$). Instead they can be calculated with whiskers:

minimum: $Q_1 - 1.5 \cdot IQR$
maximum: $Q_3 + 1.5 \cdot IQR$

In [6]:

# Random normal distribution
np.random.seed(0)
numbers = pd.Series(np.random.normal(0, 4, 1000))
 
# Creating plot
fig = plt.figure(figsize =(10, 7))
plt.boxplot(numbers)
plt.show()