Regression Analysis
The regression analysis involves a set of techniques that propose to infer the relationship between a dependent variable and one or more independent variables. There are a vary of methods and kinds that can be applied in different situations or for distinct purposes, but here we are going to address some of the more common ones.
The main goal of this post is to make a quick description and implementation as clear and intuitive as possible. All the models will be implemented using only Python and NumPy and no real dataset will be used (only data biasedly produced) for a better understanding and visualization of the results. Well, let's check it out.
You can access all the examples and codes in this post by visiting these three notebooks: Linear Regression, Logistic Regression and Polynomial Regression.
Linear Regression
Linear regression can be understood as a statistical analysis process that infers the linear relationship between a dependent variable and one or more independent variables. One way to measure the degree of dependence between X and Y is by the correlation coefficient $\large \rho_ {XY}$, which is defined by:
- $\large \mu$ is the expected value from a random variable;
- $\large \sigma$ is the standard deviation from a random variable;
- $\large E$ is the expected value operator;
- $\large cov$ is the covariance.
As an example, let's compare the correlation measure applied to random datasets. The result can vary from -1 to 1 so that values around 0 has a weak correlation and values close to -1 or 1 has a strong correlation. Negative values indicate negative correlation (decreasing slope).
Simple Linear Regression
A simple linear regression performs a treatment over two-dimensional sample points, which one represents the dependent variable $\large y$ and the other one represents the independent variable $\large x$, analytically described by:
Where $m$ describes the angular coefficient (
Where $\large \overline{y}$ and $\large \overline{x}$ are the mean values of $\large y$ and $\large x$, respectively.
Having that, we can implement the linear regression model.
Using the same random datasets, we can compare the results of the linear analysis. The red line represents the linear relationship between the variables and where the new predictions would lie if we used this trained model with new independent values.
We can notice the residual lines by comparing the actual points from data and where they would be if they were predicted using our trained model. The collection of residuals provide us a very important measure called MSE (or mean squared error), which is described by:
Where $\large Y_i$ is the actual dependent value of the data point and $\large \hat{Y}_i$ is the predicted value, using the same independent variable value as an input.
Multiple Linear Regression
A multiple linear regression performs basically the same as a simple linear regression model, but over n-dimensional sample points. This means that it has one dependent variable $\large y$, but multiple independent variables $\large x_n$, analytically described by:
Where $\large m_n$ describes the angular coefficients of each independent variable and $b$ the linear coefficient.
Using a three dimensional random datasets, we have a hyperplane that represent the linear relationship between the variables. And in the same ways as the simple regression, this plane represents where the new predictions would lie if we used this trained model with new independent values.
Gradient descent
The use of gradient descent applied to the linear regression solution is achieved by the process of minimizing the errors of the angular coefficients m and scalar coefficient b, such that:
To perform the gradient descent as a function of the error, it is necessary to calculate the gradient vector $\large \nabla$ of the function, described by:
Follows a visualization of the iterative process of a gradient descent model fitting to the linear relationship.
Logistic Regression
Logistic regression is a statistical process similar to linear regression, which solves classification problems through a hypothesis about discrete values $\large y_i$, represented by:
- $\large h_\theta(x)$ is the hypothesis;
- $\large g(z)$ is the logistic function or sigmoid;
- $\large \theta_i$ is the parameters (or weights).
As similar as linear regression, logistic regression can be adjusted by gradient descent so it is necessary to calculate the sigmoid function gradient, described by:
For the binary classification of a two-dimensional dataset, the line which describes the decision boundary is defined by:
Polynomial Regression
When a dataset is not linearly related or linearly separable, linear regression or logistic regression do not provide the best fit solution. For example, given the function:
If we try to apply a linear regression over the resulting data, we can notice that the linear solution does not fits very well to this.
Thus, a good option would be to use the polynomial regression, which is a non-linear prediction model, defined by:
where $\large \mathbf{X}$ (or $\large \mathbf{V}$) is the Vandermonde's matrix of the independent variable, parametrised by the maximum degree $\large m$, a response vector $\large \vec{y}$, a parameter vector $\large \vec{\mathbf{\beta}}$ and a random error vector $\large \vec{\epsilon}$. In the form of a system of linear equations, we have:
Using the Least Squares Method, the estimated coefficient vector is given by:
Notice that our class has an attribute called degree which is the maximum degree of our function $\large f(x)$. In our example it should be $\large m=3$.