Regression Analysis

WHAT IS REGRESSION ANALYSIS?

Regression analysis is a form of predictive modelling technique which investigates the relationship between a dependent (target/outcome/response) and independent variable(s) (predictor/explanatory). For example, relationship between rash driving and number of road accidents by a driver is best studied through regression.

The key elements of building a regression model:

Number of independent variables (e.g. univariate, multivariate)
Shape of the regression line (e.g. polynomial terms, nonlinear regression)
Type of dependent variable (e.g. generalized linear model, logistic regression, Cox proportional hazard regression)

LINEAR REGRESSION

Linear Regression establishes a relationship between dependent variable (Y) and one or more independent variables (X) using a best fit straight line (also known as regression line).

It is represented by an equation Y=a+b*X + e, where a is intercept, b is slope of the line and e is error term. This equation can be used to predict the value of target variable based on given predictor variable(s). The difference between simple (univariate) linear regression and multiple (multivariate) linear regression is that, multiple linear regression has more than one independent variable, whereas simple linear regression has only one independent variable.

Example of a linear regression scatterplot of weight vs height

The above model can be interpreted to mean that for every one inch increase in height on average there will be a 3.9 pound increase in weight.

The most common method used to fit a regression line is the Least Square Method. It calculates the best-fit line for the observed data by minimizing square of the errors. Linear regression requires that the dependent variable is a continuous variable and is normally distributed.

A few key points about Linear Regression:

Fast and easy to model and is particularly useful when the relationship to be modeled is not extremely complex and if you don’t have a lot of data.
Very intuitive to understand and interpret.
Linear Regression is very sensitive to outliers.

LOGISTIC REGRESSION

odds =

probability of event occurrence

―――――――――――――――

probability of not event occurrence

odds ratio =

odds of developing the disease given exposure

―――――――――――――――

odds of developing the disease given non-exposure

Logistic regression is used to find the probability of event=Success and event=Failure. We should use logistic regression when the dependent variable is binary (0/ 1, True/ False, Yes/ No) in nature. The logistic regression model uses the odds and odds ratio.

A few key points about Logistic Regression:

It is widely used for classification problems.
Logistic regression does not require a linear relationship between dependent and independent variables because it applies a non-linear log transformation to the predicted odds ratio.
The independent variable should not be correlated with each other (no multi collinearity).
The values of the dependent variable may also be ordinal (ordinal logistic regression) or multi class (multinomial logistic regression).

POLYNOMIAL REGRESSION

When we want to create a model that is suitable for handling non-linear data, we will need to use a polynomial regression. In this regression technique, the best fit line is not a straight line. It is rather a curve that fits into the data points. For example, we can have higher-order polynomial terms in the model something like Y=a+b*X^k. Where X^k indicates the k^th-order polynomial term, k=1, 2 (quadratic), 3 (cubic)...

Polynomial regression requires that the dependent variable is a continuous variable and is only used when the shape of the relationship between the independent and dependent variables is not a straight line.

A few key points about Polynomial Regression:

It is much more flexible in general and can model some fairly complex relationships.
Requires careful design and some knowledge of the data in order to select the best exponents.
While there might be a temptation to fit a higher degree polynomial to get lower error, this can result in over-fitting. Always plot the relationships to see the fit and focus on making sure that the curve fits the nature of the problem.

MORE ADVANCED REGRESSION TYPES & TECHNICS

Generalized Linear Models

If your dependent variable is neither continuous (general linear model) nor binary, you will need to look into a Generalized Linear Model methods to first find a distribution that best describes your dependent variable, then fit a regression model with the specific distribution and its corresponding link function.

Gamma, Inverse-Gaussian if the dependent variable is from a skewed distribution (e.g. length of stay, income, cost).
Poisson, Negative Binomial if the dependent variable consists of counts/integers (e.g. number of medication, count of procedures).
Beta distribution if the dependent variable consists of a proportion/percentages which is bounded by 0 and 1 (e.g. proportion of feed intake, any measures that are percentages).
Mixture distributions, the distribution of the dependent variable is a composition of more than one distribution.

Ordinal/multinomial regression

The values of the dependent variable may also be ordinal (ordinal logistic regression) or multi class (multinomial logistic regression).
An example of an ordinal outcome would be a Likert scale (strongly agree, agree, undecided, disagree, strongly disagree), and an example of a multinomial outcome would be a list of selected objects (e.g. type of lesions).
These models are more difficult to interpret, and where possible you should try to combine the levels of the dependent variable to make logistic regression suitable.

Variable selection methods under regression framework: Backward Elimination, Forward Selection, and Stepwise Regression

This form of regression is used to select the independent variables with the help of an automatic process. The aim of modeling techniques is to maximize the prediction power and minimize the number of predictor variables. Some of the most commonly used model selection methods are:

Backward elimination starts with all predictors in the model and removes the least significant variable for each step.
Forward selection starts with most significant predictor in the model and adds variable for each step.
Standard stepwise regression does two things. It adds and removes predictors as needed for each step.

Sources

ADDITIONAL RESOURCES

Regression techniques are one of the most popular statistical techniques used for predictive modeling and data mining tasks. On average, analytics professionals know only 2-3 types of regression which are commonly used in real world. They are linear and logistic regression. But the fact is there are more than 10 types of regression algorithms designed for various types of analysis. Each type has its own significance. Every analyst must know which form of regression to use depending on type of data and distribution.

15 Types of Regression You Should Know
For in-depth consideration of specific types of regression models, consult:
- Penn State Stat 501 Regression Methods
  - Lesson 15 on Logistic, Poisson & Nonlinear Regression is especially helpful
For software code and results interpretation, visit stats.idre.ucla.edu.
- Use the fourth column of the table to look up the regression type assuring that your data match the information from the first three columns. Regardless of the software you select from the table the website provides information on the theory behind the model and interpretation of output from the software.

Request Collaboration

Our Office

Williams Building
University of Utah Research Park
Williams Building, 1st floor
295 South Chipeta Way
Salt Lake City, Utah
Map

Parking: During construction, you may park on the bottom floor of the south parking structure.

Contact

Camie Derricott
Camie.Derricott@hsc.utah.edu

Acknowledging the SDBC

Please use the following text to acknowledge the CTSI Study Design and Biostatistics Center:

"This investigation was supported by Translational Research: Implementation, Analysis and Design (TRIAD), with funding in part from the National Center for Advancing Translational Sciences of the National Institutes of Health under Award Number UM1TR004409. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health."

"This investigation was supported by the Study Design and Biostatistics Center (SDBC), with funding in part from the National Center for Advancing Translational Sciences of the National Institutes of Health under Award Number UM1TR004409. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health."