There are three common abuses of regression analysis.
- Extrapolation
- Generalization
- Causation.
Extrapolation
If you were dealing in the stock market or even interested in it, we remember the stock market crash of March 2000. During 1997-1999, many investors thought they would double their money every year, started buying fancy cars and houses on credit, and living the high life. Little did they know that the whole market was hyped on speculation and little economic sense? Enron and MCI financial fiascos were soon to follow.
Let us look if we could have safely extrapolated NASDAQ index from past years. Below is the table of NASDAQ index, S as a function of end of year number, t (Year 1 is the end of year 1994, and Year 6 is the end of year 1999).
Table 1 NASDAQ index as a function of year number.
Year Number (t) |
NASDAQ Index (S) |
1 (1994) |
752 |
2 (1995) |
1052 |
3 (1996) |
1291 |
4 (1997) |
1570 |
5 (1998) |
2193 |
6 (1999) |
4069 |
A relationship S = a_{0}+a_{1}t+a_{2}t^{2 }between the NASDAQ index, S and the year number, t is developed using least square regression and is found to be
S=168.14t^{2} – 597.35t + 1361.8
The data is given for Years 1 thru 6 and it is desired to calculate the value for t>=6. This is extrapolation outside the model data. The error inherent in this model is shown in Table 2. Look at the Year 7 and 8 that was not included in the regression data – the error between the predicted and actual values is 119% and 277%, respectively.
Table 2 NASDAQ index as a function of year number.
Year Number (t) |
NASDAQ Index (S) |
Predicted Index |
Absolute Relative True Error (%) |
1 (1994) |
752 |
933 |
24 |
2 (1995) |
1052 |
840 |
20 |
3 (1996) |
1291 |
1082 |
16 |
4 (1997) |
1570 |
1663 |
6 |
5 (1998) |
2193 |
2578 |
18 |
6 (1999) |
4069 |
3831 |
6 |
7 (2000) |
2471 |
5419 |
119 |
8 (2001) |
1951 |
7344 |
277 |
This illustration is not exaggerated and it is important that a careful use of any given model equations is always called for. At all times, it is imperative to infer the domain of independent variables for which a given equation is valid.
Generalization
Generalization could arise when unsupported or overexaggerated claims are made. It is not often possible to measure all predictor variables relevant in a study. For example, a study carried out about the behavior of men might have inadvertently restricted the survey to Caucasian men. Shall we then generalize the result as the attributes of all men irrespective of race? Such use of regression equation is an abuse since the limitations imposed by the data restrict the use of the prediction equations to Caucasian men.
Misidentification
Finally, misidentification of causation is a classic abuse of regression analysis equations. Regression analysis can only aid in the confirmation or refutation of a causal model ‑ the model must however have a theoretical basis. In a chemical reacting system in which two species react to form a product, the amount of product formed or amount of reacting species vary with time. Although a regression equation of species concentration and time can be obtained, one cannot attribute time as the causal agent for the varying species concentration. Regression analysis cannot prove causality; rather they can only substantiate or contradict causal assumptions. Anything outside this is an abuse of the use of regression analysis method.
This post used textbook notes written by the author and Egwu Kalu, Professor of Chemical and Biomedical Engineering, FAMU, Tallahassee, FL.
____________________________________________________
This post is brought to you by Holistic Numerical Methods: Numerical Methods for the STEM undergraduate at http://numericalmethods.eng.usf.edu
Subscribe to the blog via a reader or email to stay updated with this blog. Let the information follow you.