Contributed by: Prashanth Ashok Show
What is Regression?Regression is defined as a statistical method that helps us to analyze and understand the relationship between two or more variables of interest. The process that is adapted to perform regression analysis helps to understand which factors are important, which factors can be ignored, and how they are influencing each other. In regression, we normally have one dependent variable and one or more independent variables. Here we try to “regress” the value of the dependent variable “Y” with the help of the independent variables. In other words, we are trying to understand, how the value of ‘Y’ changes w.r.t change in ‘X’. For the regression analysis is be a successful method, we understand the following terms:
What is Regression Analysis?Regression analysis is used for prediction and forecasting. This has substantial overlap with the field of machine learning. This statistical method is used across different industries such as,
If you want to learn everything there is to know about Excel Regression Analysis, then you can take up an online course. You’ll get to learn how to use regression analysis to predict future trends, understand data, and make better decisions. Terminologies used in Regression AnalysisOutliersSuppose there is an observation in the dataset that has a very high or very low value as compared to the other observations in the data, i.e. it does not belong to the population, such an observation is called an outlier. In simple words, it is an extreme value. An outlier is a problem because many times it hampers the results we get. MulticollinearityWhen the independent variables are highly correlated to each other, then the variables are said to be multicollinear. Many types of regression techniques assume multicollinearity should not be present in the dataset. It is because it causes problems in ranking variables based on its importance, or it makes the job difficult in selecting the most important independent variable. HeteroscedasticityWhen the variation between the target variable and the independent variable is not constant, it is called heteroscedasticity. Example-As one’s income increases, the variability of food consumption will increase. A poorer person will spend a rather constant amount by always eating inexpensive food; a wealthier person may occasionally buy inexpensive food and at other times, eat expensive meals. Those with higher incomes display a greater variability of food consumption. Underfit and OverfitWhen we use unnecessary explanatory variables, it might lead to overfitting. Overfitting means that our algorithm works well on the training set but is unable to perform better on the test sets. It is also known as a problem of high variance. When our algorithm works so poorly that it is unable to fit even a training set well, then it is said to underfit the data. It is also known as a problem of high bias. Types of RegressionFor different types of Regression analysis, there are assumptions that need to be considered along with understanding the nature of variables and their distribution.
Linear RegressionThe simplest of all regression types is Linear Regression which tries to establish relationships between Independent and Dependent variables. The Dependent variable considered here is always a continuous variable. What is Linear Regression?Linear Regression is a predictive model used for finding the linear relationship between a dependent variable and one or more independent variables. Here, ‘Y’ is our dependent variable, which is a continuous numerical and we are trying to understand how ‘Y’ changes with ‘X’. So, if we are supposed to answer, the above question of “What will be the GRE score of the student, if his CCGPA is 8.32?” our go-to option should be linear regression. Examples of Independent & Dependent Variables:• Here x is Rainfall and y is Crop Yield • Secondly, x is Advertising Expense and y is Sales • At last, x is sales of goods and y is GDP If the relationship with the dependent variable is in the form of single variables, then it is known as Simple Linear Regression Simple Linear RegressionX —–> Y If the relationship between Independent and dependent variables is multiple in number, then it is called Multiple Linear Regression Multiple Linear RegressionSimple Linear Regression ModelAs the model is used to predict the dependent variable, the relationship between the variables can be written in the below format. Yi = β0 + β1 Xi +εi Where, Yi – Dependent variable β0 -- Intercept β1 – Slope Coefficient Xi – Independent Variable εi – Random Error TermThe main factor that is considered as part of Regression analysis is understanding the variance between the variables. For understanding the variance, we need to understand the measures of variation. SST = total sum of squares (Total Variation) Measures the variation of the Y i values around their mean Y SSR = regression sum of squares (Explained Variation) Variation attributable to the relationship between X and Y SSE = error sum of squares (Unexplained Variation) Variation in Y attributable to factors other than XWith all these factors taken into consideration, before we start assessing if the model is doing good, we need to consider the assumptions of Linear Regression. Assumptions:Since Linear Regression assesses whether one or more predictor variables explain the dependent variable and hence it has 5 assumptions:
With these assumptions considered while building the model, we can build the model and do our predictions for the dependent variable. For any type of machine learning model, we need to understand if the variables considered for the model are correct and have been analysed by a metric. In the case of Regression analysis, the statistical measure that evaluates the model is called the coefficient of determination which is represented as r2. The coefficient of determination is the portion of the total variation in the dependent variable that is explained by variation in the independent variable. A higher value of r2 better is than the model with the independent variables being considered for the model. r2 = SSR SST Note: The value of r2 is the range of 0≤ r2≤1Polynomial RegressionThis type of regression technique is used to model nonlinear equations by taking polynomial functions of independent variables. In the figure given below, you can see the red curve fits the data better than the green curve. Hence in the situations where the relationship between the dependent and independent variable seems to be non-linear, we can deploy Polynomial Regression Models. Thus a polynomial of degree k in one variable is written as: Here we can create new features like and can fit linear regression in a similar manner.In the case of multiple variables say X1 and X2, we can create a third new feature (say X3) which is the product of X1 and X2 i.e. The main drawback of this type of regression model is if we create unnecessary extra features or fitting polynomials of a higher degree this may lead to overfitting of the model. Logistic RegressionLogistic Regression is also known as Logit, Maximum-Entropy classifier is a supervised learning method for classification. It establishes a relation between dependent class variables and independent variables using regression. The dependent variable is categorical i.e. it can take only integral values representing different classes. The probabilities describing the possible outcomes of a query point are modelled using a logistic function. This model belongs to a family of discriminative classifiers. They rely on attributes which discriminate the classes well. This model is used when we have 2 classes of dependent variables. When there are more than 2 classes, then we have another regression method which helps us to predict the target variable better. There are two broad categories of Logistic Regression algorithms
There are two types of Multinomial Logistic Regression
Process MethodologyLogistic regression takes into consideration the different classes of dependent variables and assigns probabilities to the event happening for each row of information. These probabilities are found by assigning different weights to each independent variable by understanding the relationship between the variables. If the correlation between the variables is high, then positive weights are assigned and in the case of an inverse relationship, negative weight is assigned. As the model is mainly used to classify the classes of target variables as either 0 or 1, thus the Sigmoid function is obtained by implementing the log-normal function on these probabilities that are calculated on these independent variables. The Sigmoid function: P(y= 1) = Sigmoid(Z) = 1/(1 + e -z) P(y= 0) = 1 –P(y =1) = 1 –(1/(1 + e -z)) = e –z/ (1 + e -z) y = 1 if P(y=1|X) > .5, else y = 0 where the default probability cut off is taken as 0.5.This method is also called the Odds Log ratio. Assumptions
Some examples where this model can be used for predictions.
Linear Discriminant Analysis (LDA)Discriminant Analysis is used for classifying observations into a class or category based on predictor (independent) variables of the data. Discriminant Analysis creates a model to predict future observations where the classes are known. LDA comes to our rescue in situations when logistic regression is unstable when
Working Process of LDA ModelThe LDA model uses Bayes’ Theorem to estimate probabilities. They make predictions upon the probability that a new input dataset belongs to each class. The class which has the highest probability is considered as the output class and then the LDA makes a prediction. The prediction is made simply by the use of Bayes’ theorem which estimates the probability of the output class given the input. They also make use of the probability of each class and also the data belonging to that class: Regularized Linear ModelsThis method is used to solve the problem of overfitting of the model which arises due to the model performing poorly on test data. This model helps us to solve the problem by adding an error term to the objective function to reduce the bias in the model. Regularization is generally useful in the following situations:
L1 Loss function or L1 RegularizationIn L1 regularization we try to minimize the objective function by adding a penalty term to the sum of the absolute values of coefficients. This is also known as the least absolute deviations method. Lasso Regression (Least Absolute Shrinkage Selector Operator) makes use of L1 regularization. It takes the minimum absolute values of the coefficients. The cost function for lasso regression Min(||Y - X(theta)||^2 + λ||theta||) λ is the hypermeter, whose value is equal to the alpha in the Lasso function It is generally used when we have more number of features because it automatically does feature selection.L2 Loss function or L2 RegularizationIn L2 regularization we try to minimize the objective function by adding a penalty term to the sum of the squares of coefficients. Ridge Regression or shrinkage regression makes use of L2 regularization. This model assumes the square of the absolute values of the coefficient. The cost function for ridge regression Min(||Y - X(theta)||^2 + λ||theta||^2)Lambda is the penalty term. λ given here is actually denoted by an alpha parameter in the ridge function. So by changing the values of alpha, we are basically controlling the penalty term. The higher the values of alpha, the bigger the penalty and therefore the magnitude of coefficients is reduced. It shrinks the parameters, therefore it is mostly used to prevent multicollinearity It reduces the model complexity by coefficient shrinkage Value of alpha, which is a hyperparameter of Ridge, which means that they are not automatically learned by the model instead they have to be set manually. A combination of both Lasso and Ridge regression methods brings rise to a method called Elastic Net Regression where the cost function is : Min(||Y-Xtheta||^2 + Lambda1||theta|| + lambda2||theta||^2)What mistakes do people make when working with regression analysis?When working with regression analysis, it is important to understand the problem statement properly. If the problem statement talks about forecasting, we should probably use linear regression. If the problem statement talks about binary classification, we should use logistic regression. Similarly, depending on the problem statement we need to evaluate all our regression models. To learn more about such concepts, take up Data Science and Business analytics Certificate Courses and upskill today. Learn with the help of online mentorship sessions and career assistance. If you have any queries, feel free to leave them in the comments below and we’ll get back to you at the earliest. |