# Abuses of regression

There are three common abuses of regression analysis.

1. Extrapolation
2. Generalization
3. Causation.

Extrapolation

If you were dealing in the stock market or even interested in it, we remember the stock market crash of March 2000. During 1997-1999, many investors thought they would double their money every year, started buying fancy cars and houses on credit, and living the high life. Little did they know that the whole market was hyped on speculation and little economic sense? Enron and MCI financial fiascos were soon to follow.

Let us look if we could have safely extrapolated NASDAQ index from past years. Below is the table of NASDAQ index, S as a function of end of year number, t (Year 1 is the end of year 1994, and Year 6 is the end of year 1999).

Table 1 NASDAQ index as a function of year number.

 Year Number (t) NASDAQ Index (S) 1 (1994) 752 2 (1995) 1052 3 (1996) 1291 4 (1997) 1570 5 (1998) 2193 6 (1999) 4069

A relationship S = a0+a1t+a2t2 between the NASDAQ index, S and the year number, t is developed using least square regression and is found to be

S=168.14t2 – 597.35t + 1361.8

The data is given for Years 1 thru 6 and it is desired to calculate the value for t>=6. This is extrapolation outside the model data. The error inherent in this model is shown in Table 2. Look at the Year 7 and 8 that was not included in the regression data – the error between the predicted and actual values is 119% and 277%, respectively.

Table 2 NASDAQ index as a function of year number.

 Year Number (t) NASDAQ Index (S) Predicted Index Absolute Relative True Error (%) 1 (1994) 752 933 24 2 (1995) 1052 840 20 3 (1996) 1291 1082 16 4 (1997) 1570 1663 6 5 (1998) 2193 2578 18 6 (1999) 4069 3831 6 7 (2000) 2471 5419 119 8 (2001) 1951 7344 277

This illustration is not exaggerated and it is important that a careful use of any given model equations is always called for. At all times, it is imperative to infer the domain of independent variables for which a given equation is valid.

Generalization

Generalization could arise when unsupported or overexaggerated claims are made. It is not often possible to measure all predictor variables relevant in a study. For example, a study carried out about the behavior of men might have inadvertently restricted the survey to Caucasian men. Shall we then generalize the result as the attributes of all men irrespective of race? Such use of regression equation is an abuse since the limitations imposed by the data restrict the use of the prediction equations to Caucasian men.

Misidentification

Finally, misidentification of causation is a classic abuse of regression analysis equations. Regression analysis can only aid in the confirmation or refutation of a causal model ‑ the model must however have a theoretical basis. In a chemical reacting system in which two species react to form a product, the amount of product formed or amount of reacting species vary with time. Although a regression equation of species concentration and time can be obtained, one cannot attribute time as the causal agent for the varying species concentration. Regression analysis cannot prove causality; rather they can only substantiate or contradict causal assumptions. Anything outside this is an abuse of the use of regression analysis method.

This post used textbook notes written by the author and Egwu Kalu, Professor of Chemical and Biomedical Engineering, FAMU, Tallahassee, FL.

____________________________________________________

This post is brought to you by Holistic Numerical Methods: Numerical Methods for the STEM undergraduate at http://numericalmethods.eng.usf.edu

Subscribe to the blog via a reader or email to stay updated with this blog. Let the information follow you.

## Author: Autar Kaw

Autar Kaw (http://autarkaw.com) is a Professor of Mechanical Engineering at the University of South Florida. He has been at USF since 1987, the same year in which he received his Ph. D. in Engineering Mechanics from Clemson University. He is a recipient of the 2012 U.S. Professor of the Year Award. With major funding from NSF, he is the principal and managing contributor in developing the multiple award-winning online open courseware for an undergraduate course in Numerical Methods. The OpenCourseWare (nm.MathForCollege.com) annually receives 1,000,000+ page views, 1,000,000+ views of the YouTube audiovisual lectures, and 150,000+ page views at the NumericalMethodsGuy blog. His current research interests include engineering education research methods, adaptive learning, open courseware, massive open online courses, flipped classrooms, and learning strategies. He has written four textbooks and 80 refereed technical papers, and his opinion editorials have appeared in the St. Petersburg Times and Tampa Tribune.