Is a statistical method to model the relationship between a dependent target and independent predictor variables with one or more independent variables?

Contributed by: Prashanth Ashok

Table of Contents Show

What is Regression?
What is Regression Analysis?
Terminologies used in Regression Analysis
Outliers
Multicollinearity
Heteroscedasticity
Underfit and Overfit
Types of Regression
Linear Regression
What is Linear Regression?
Polynomial Regression
Logistic Regression
Linear Discriminant Analysis (LDA)
Working Process of LDA Model
Regularized Linear Models
What mistakes do people make when working with regression analysis?

What is Regression?

Regression is defined as a statistical method that helps us to analyze and understand the relationship between two or more variables of interest. The process that is adapted to perform regression analysis helps to understand which factors are important, which factors can be ignored, and how they are influencing each other.

In regression, we normally have one dependent variable and one or more independent variables. Here we try to “regress” the value of the dependent variable “Y” with the help of the independent variables. In other words, we are trying to understand, how the value of ‘Y’ changes w.r.t change in ‘X’.

For the regression analysis is be a successful method, we understand the following terms:

Dependent Variable: This is the variable that we are trying to understand or forecast.
Independent Variable: These are factors that influence the analysis or target variable and provide us with information regarding the relationship of the variables with the target variable.

What is Regression Analysis?

Regression analysis is used for prediction and forecasting. This has substantial overlap with the field of machine learning. This statistical method is used across different industries such as,

Financial Industry- Understand the trend in the stock prices, forecast the prices, and evaluate risks in the insurance domain
Marketing- Understand the effectiveness of market campaigns, and forecast pricing and sales of the product.
Manufacturing- Evaluate the relationship of variables that determine to define a better engine to provide better performance
Medicine- Forecast the different combinations of medicines to prepare generic medicines for diseases.

Regression Meaning In Simple terms Let’s understand the concept of regression with this example. You are conducting a case study on a set of college students to understand if students with high CGPA also get a high GRE score. Your first task would be to collect the details of all the students. We go ahead and collect the GRE scores and CGPAs of the students of this college. All the GRE scores are listed in one column and the CGPAs are listed in another column. Now, if we are supposed to understand the relationship between these two variables, we can draw a scatter plot. Here, we see that there’s a linear relationship between CGPA and GRE score which means that as the CGPA increases, the GRE score also increases. This would also mean that a student who has a high CGPA, would also have a higher probability of getting a high GRE score. But what if I ask, “The CGPA of the student is 8.32, what will be the GRE score of the student?“ This is where Regression comes in. If we are supposed to find the relationship between two variables, we can apply regression analysis.

If you want to learn everything there is to know about Excel Regression Analysis, then you can take up an online course. You’ll get to learn how to use regression analysis to predict future trends, understand data, and make better decisions.

Terminologies used in Regression Analysis

Outliers

Suppose there is an observation in the dataset that has a very high or very low value as compared to the other observations in the data, i.e. it does not belong to the population, such an observation is called an outlier. In simple words, it is an extreme value. An outlier is a problem because many times it hampers the results we get.

Multicollinearity

When the independent variables are highly correlated to each other, then the variables are said to be multicollinear. Many types of regression techniques assume multicollinearity should not be present in the dataset. It is because it causes problems in ranking variables based on its importance, or it makes the job difficult in selecting the most important independent variable.

Heteroscedasticity

When the variation between the target variable and the independent variable is not constant, it is called heteroscedasticity. Example-As one’s income increases, the variability of food consumption will increase. A poorer person will spend a rather constant amount by always eating inexpensive food; a wealthier person may occasionally buy inexpensive food and at other times, eat expensive meals. Those with higher incomes display a greater variability of food consumption.

Underfit and Overfit

When we use unnecessary explanatory variables, it might lead to overfitting. Overfitting means that our algorithm works well on the training set but is unable to perform better on the test sets. It is also known as a problem of high variance.

When our algorithm works so poorly that it is unable to fit even a training set well, then it is said to underfit the data. It is also known as a problem of high bias.

Types of Regression

For different types of Regression analysis, there are assumptions that need to be considered along with understanding the nature of variables and their distribution.

Linear Regression
Polynomial Regression
Logistic Regression

Linear Regression

The simplest of all regression types is Linear Regression which tries to establish relationships between Independent and Dependent variables. The Dependent variable considered here is always a continuous variable.

What is Linear Regression?

Linear Regression is a predictive model used for finding the linear relationship between a dependent variable and one or more independent variables.

Here, ‘Y’ is our dependent variable, which is a continuous numerical and we are trying to understand how ‘Y’ changes with ‘X’.

So, if we are supposed to answer, the above question of “What will be the GRE score of the student, if his CCGPA is 8.32?” our go-to option should be linear regression.

Examples of Independent & Dependent Variables:

• Here x is Rainfall and y is Crop Yield

• Secondly, x is Advertising Expense and y is Sales

• At last, x is sales of goods and y is GDP

If the relationship with the dependent variable is in the form of single variables, then it is known as Simple Linear Regression

Simple Linear Regression

X —–> Y

If the relationship between Independent and dependent variables is multiple in number, then it is called Multiple Linear Regression

Multiple Linear Regression

Simple Linear Regression Model

As the model is used to predict the dependent variable, the relationship between the variables can be written in the below format.

Yi = β0 + β1 Xi +εi Where, Yi – Dependent variable β0 -- Intercept β1 – Slope Coefficient Xi – Independent Variable εi – Random Error Term

The main factor that is considered as part of Regression analysis is understanding the variance between the variables. For understanding the variance, we need to understand the measures of variation.

SST = total sum of squares (Total Variation) Measures the variation of the Y i values around their mean Y SSR = regression sum of squares (Explained Variation) Variation attributable to the relationship between X and Y SSE = error sum of squares (Unexplained Variation) Variation in Y attributable to factors other than X

With all these factors taken into consideration, before we start assessing if the model is doing good, we need to consider the assumptions of Linear Regression.

Assumptions:

Since Linear Regression assesses whether one or more predictor variables explain the dependent variable and hence it has 5 assumptions:

Linear Relationship
Normality
No or Little Multicollinearity
No Autocorrelation in errors
Homoscedasticity

With these assumptions considered while building the model, we can build the model and do our predictions for the dependent variable. For any type of machine learning model, we need to understand if the variables considered for the model are correct and have been analysed by a metric. In the case of Regression analysis, the statistical measure that evaluates the model is called the coefficient of determination which is represented as r2.

The coefficient of determination is the portion of the total variation in the dependent variable that is explained by variation in the independent variable. A higher value of r2 better is than the model with the independent variables being considered for the model.

r2 = SSR SST Note: The value of r2 is the range of 0≤ r2≤1

Polynomial Regression

This type of regression technique is used to model nonlinear equations by taking polynomial functions of independent variables.

In the figure given below, you can see the red curve fits the data better than the green curve. Hence in the situations where the relationship between the dependent and independent variable seems to be non-linear, we can deploy Polynomial Regression Models.

Thus a polynomial of degree k in one variable is written as:

Here we can create new features like

and can fit linear regression in a similar manner.In the case of multiple variables say X1 and X2, we can create a third new feature (say X3) which is the product of X1 and X2 i.e.

The main drawback of this type of regression model is if we create unnecessary extra features or fitting polynomials of a higher degree this may lead to overfitting of the model.

Logistic Regression

Logistic Regression is also known as Logit, Maximum-Entropy classifier is a supervised learning method for classification. It establishes a relation between dependent class variables and independent variables using regression.

The dependent variable is categorical i.e. it can take only integral values representing different classes. The probabilities describing the possible outcomes of a query point are modelled using a logistic function. This model belongs to a family of discriminative classifiers. They rely on attributes which discriminate the classes well. This model is used when we have 2 classes of dependent variables. When there are more than 2 classes, then we have another regression method which helps us to predict the target variable better.

There are two broad categories of Logistic Regression algorithms

Binary Logistic Regression when the dependent variable is strictly binary
Multinomial Logistic Regression is when the dependent variable has multiple categories.

There are two types of Multinomial Logistic Regression

Ordered Multinomial Logistic Regression (dependent variable has ordered values)
Nominal Multinomial Logistic Regression (dependent variable has unordered categories)

Process Methodology

Logistic regression takes into consideration the different classes of dependent variables and assigns probabilities to the event happening for each row of information. These probabilities are found by assigning different weights to each independent variable by understanding the relationship between the variables. If the correlation between the variables is high, then positive weights are assigned and in the case of an inverse relationship, negative weight is assigned.

As the model is mainly used to classify the classes of target variables as either 0 or 1, thus the Sigmoid function is obtained by implementing the log-normal function on these probabilities that are calculated on these independent variables.

The Sigmoid function:

P(y= 1) = Sigmoid(Z) = 1/(1 + e -z) P(y= 0) = 1 –P(y =1) = 1 –(1/(1 + e -z)) = e –z/ (1 + e -z) y = 1 if P(y=1|X) > .5, else y = 0 where the default probability cut off is taken as 0.5.

This method is also called the Odds Log ratio.

Assumptions

The dependent variable is categorical. Dichotomous for binary logistic regression and multi-label for multi-class classification
Attributes and log odds i.e. log(p / 1-p) should be linearly related to the independent variables
Attributes are independent of each other (low or no multicollinearity)
In binary logistic regression class of interest is coded with 1 and other class 0
In multi-class classification using Multinomial Logistic Regression or OVR scheme, class of interest is coded 1 and rest 0(this is done by the algorithm)

Note: The assumptions of Linear Regression such as homoscedasticity, normal distribution of error terms, a linear relationship between the dependent and independent variables are not required here.

Some examples where this model can be used for predictions.

Predicting the weather: You can only have a few definite weather types. Stormy, sunny, cloudy, rainy and a few more.
Medical diagnosis: Given the symptoms predicted the disease patient is suffering from.
Credit Default: If a loan has to be given to a particular candidate depends on his identity check, account summary, any properties he holds, any previous loan, etc
HR Analytics: IT firms recruit a large number of people, but one of the problems they encounter is after accepting the job offer many candidates do not join. So, this results in cost overruns because they have to repeat the entire process again. Now when you get an application, can you actually predict whether that applicant is likely to join the organization (Binary Outcome – Join / Not Join).
Elections: Suppose that we are interested in the factors that influence whether a political candidate wins an election. The outcome (response) variable is binary (0/1); win or lose. The predictor variables of interest are the amount of money spent on the campaign and the amount of time spent campaigning negatively.

Linear Discriminant Analysis (LDA)

Discriminant Analysis is used for classifying observations into a class or category based on predictor (independent) variables of the data.

Discriminant Analysis creates a model to predict future observations where the classes are known.

LDA comes to our rescue in situations when logistic regression is unstable when

Classed are well separated
Data is small
When we have more than 2 classes

Working Process of LDA Model

The LDA model uses Bayes’ Theorem to estimate probabilities. They make predictions upon the probability that a new input dataset belongs to each class. The class which has the highest probability is considered as the output class and then the LDA makes a prediction.

The prediction is made simply by the use of Bayes’ theorem which estimates the probability of the output class given the input. They also make use of the probability of each class and also the data belonging to that class:

P(Y=x|X=x) = [(Plk* fk(x))] / [sum(Pll* fl(x))] Where k=output class Plk= Nk/n or base probability of each class observed in the training data. It is also called prior probability in Bayes’ theorem. fk(x) = estimated probability of x belonging to class k.

Regularized Linear Models

This method is used to solve the problem of overfitting of the model which arises due to the model performing poorly on test data. This model helps us to solve the problem by adding an error term to the objective function to reduce the bias in the model.

Regularization is generally useful in the following situations:

A large number of variables
Low ratio of number of observations to the number of variables
High Multicollinearity

L1 Loss function or L1 Regularization

In L1 regularization we try to minimize the objective function by adding a penalty term to the sum of the absolute values of coefficients. This is also known as the least absolute deviations method. Lasso Regression (Least Absolute Shrinkage Selector Operator) makes use of L1 regularization. It takes the minimum absolute values of the coefficients.

The cost function for lasso regression

Min(||Y - X(theta)||^2 + λ||theta||) λ is the hypermeter, whose value is equal to the alpha in the Lasso function It is generally used when we have more number of features because it automatically does feature selection.

L2 Loss function or L2 Regularization

In L2 regularization we try to minimize the objective function by adding a penalty term to the sum of the squares of coefficients. Ridge Regression or shrinkage regression makes use of L2 regularization. This model assumes the square of the absolute values of the coefficient.

The cost function for ridge regression

Min(||Y - X(theta)||^2 + λ||theta||^2)

Lambda is the penalty term. λ given here is actually denoted by an alpha parameter in the ridge function. So by changing the values of alpha, we are basically controlling the penalty term. The higher the values of alpha, the bigger the penalty and therefore the magnitude of coefficients is reduced.

It shrinks the parameters, therefore it is mostly used to prevent multicollinearity

It reduces the model complexity by coefficient shrinkage

Value of alpha, which is a hyperparameter of Ridge, which means that they are not automatically learned by the model instead they have to be set manually.

A combination of both Lasso and Ridge regression methods brings rise to a method called Elastic Net Regression where the cost function is :

Min(||Y-Xtheta||^2 + Lambda1||theta|| + lambda2||theta||^2)

What mistakes do people make when working with regression analysis?

When working with regression analysis, it is important to understand the problem statement properly. If the problem statement talks about forecasting, we should probably use linear regression. If the problem statement talks about binary classification, we should use logistic regression. Similarly, depending on the problem statement we need to evaluate all our regression models.

To learn more about such concepts, take up Data Science and Business analytics Certificate Courses and upskill today. Learn with the help of online mentorship sessions and career assistance. If you have any queries, feel free to leave them in the comments below and we’ll get back to you at the earliest.