## Experiment for spline interpolation and integration

Background:

The motivation behind the experiment is to understand spline interpolation and numerical integration by finding the volume of water that can be held by a champagne glass.

What does the student do in the lab:

The student chooses one of the odd-shaped champagne glasses (Figure 1). The student measures the outer radius of the champagne glass at different known locations along the height. The student measures the thickness of the glass, so that he/she will be able to find the inner radius of the champagne glass at the locations he/she measured the outer radius. The student pours water to the brim in the champagne glass and checks how much volume the champagne glass holds.

Exercises assigned to the students:
Use MATLAB to solve problems. Use comments, display commands and fprintf statements, sensible variable names and units to explain your work. Staple all the work in the following sequence. Use USCS system of units throughout.

1. Attach the data sheet on which you collected the data in class.
2. Find the spline interpolant that curve fits the radius vs height data.
3. Show the individual points and the spline interpolant of radius vs height on a single plot.
4. Find how much volume of water the champagne glass would hold.
5. Compare the above result from problem#4 to the actual volume.
6. In 100-200 words, type out your conclusions using a word processor. Any formulas should be shown using an equation editor. Any sketches need to be drawn using a drawing software such as Word Drawing. Any plots can be imported from MATLAB.

What materials do you need; where do I buy it; how much do the materials costs?

1. Champagne Glasses: These glasses, called the Hurricane Plastic Glasses, are available at www.poolsidepineapple.com, part nos. HUR-105, HUR-106, YAR-114. We used glasses made of plastic to avoid breakage. http:/www.poolsidepineapple.com/cart_pages/shopping%20page%20tropical.htm. You can try other places to buy the champagne glasses. About $40 or so for about six pieces including S&H. Better yet, go to a cruise and get souvenir glasses. Whenever you do the experiment, you will remember the good times. 2. Graduated Cylinder: The graduated cylinder is available at http://scientificsonline.com/, part number 3036286. The cost of the cylinder is$20+S&H.
3. Vernier Caliper: The caliper is available at http://mcmaster.com part number 20265A49. The cost of the vernier caliper is $60+S&H. 4. Scale: Need to buy a thin scale for this. Any art-supplies store for$2 or so.

_____________________________________________________

This post is brought to you by Holistic Numerical Methods: Numerical Methods for the STEM undergraduate at http://numericalmethods.eng.usf.edu

## Abuses of regression

There are three common abuses of regression analysis.

1. Extrapolation
2. Generalization
3. Causation.

Extrapolation

If you were dealing in the stock market or even interested in it, we remember the stock market crash of March 2000. During 1997-1999, many investors thought they would double their money every year, started buying fancy cars and houses on credit, and living the high life. Little did they know that the whole market was hyped on speculation and little economic sense? Enron and MCI financial fiascos were soon to follow.

Let us look if we could have safely extrapolated NASDAQ index from past years. Below is the table of NASDAQ index, S as a function of end of year number, t (Year 1 is the end of year 1994, and Year 6 is the end of year 1999).

Table 1 NASDAQ index as a function of year number.

 Year Number (t) NASDAQ Index (S) 1 (1994) 752 2 (1995) 1052 3 (1996) 1291 4 (1997) 1570 5 (1998) 2193 6 (1999) 4069

A relationship S = a0+a1t+a2t2 between the NASDAQ index, S and the year number, t is developed using least square regression and is found to be

S=168.14t2 – 597.35t + 1361.8

The data is given for Years 1 thru 6 and it is desired to calculate the value for t>=6. This is extrapolation outside the model data. The error inherent in this model is shown in Table 2. Look at the Year 7 and 8 that was not included in the regression data – the error between the predicted and actual values is 119% and 277%, respectively.

Table 2 NASDAQ index as a function of year number.

 Year Number (t) NASDAQ Index (S) Predicted Index Absolute Relative True Error (%) 1 (1994) 752 933 24 2 (1995) 1052 840 20 3 (1996) 1291 1082 16 4 (1997) 1570 1663 6 5 (1998) 2193 2578 18 6 (1999) 4069 3831 6 7 (2000) 2471 5419 119 8 (2001) 1951 7344 277

This illustration is not exaggerated and it is important that a careful use of any given model equations is always called for. At all times, it is imperative to infer the domain of independent variables for which a given equation is valid.

Generalization

Generalization could arise when unsupported or overexaggerated claims are made. It is not often possible to measure all predictor variables relevant in a study. For example, a study carried out about the behavior of men might have inadvertently restricted the survey to Caucasian men. Shall we then generalize the result as the attributes of all men irrespective of race? Such use of regression equation is an abuse since the limitations imposed by the data restrict the use of the prediction equations to Caucasian men.

Misidentification

Finally, misidentification of causation is a classic abuse of regression analysis equations. Regression analysis can only aid in the confirmation or refutation of a causal model ‑ the model must however have a theoretical basis. In a chemical reacting system in which two species react to form a product, the amount of product formed or amount of reacting species vary with time. Although a regression equation of species concentration and time can be obtained, one cannot attribute time as the causal agent for the varying species concentration. Regression analysis cannot prove causality; rather they can only substantiate or contradict causal assumptions. Anything outside this is an abuse of the use of regression analysis method.

This post used textbook notes written by the author and Egwu Kalu, Professor of Chemical and Biomedical Engineering, FAMU, Tallahassee, FL.

____________________________________________________

This post is brought to you by Holistic Numerical Methods: Numerical Methods for the STEM undergraduate at http://numericalmethods.eng.usf.edu

## How do you know that the least squares regression line is unique and corresponds to a minimum

We already know that using the criterion of either

1. minimizing sum of residuals OR
2. minimizing sum of the absolute value of residuals

is BAD as either of the criteria do not give a unique line. Visit these notes for an example where these criteria are shown to be inadequate.

So we use minimizing the sum of the squares of the residuals as the criterion. How can we show that this criterion gives a unique line?

The proof is given below as image files because the proof is equation intensive. I made a better resolution pdf file also.

_____________________________________________________

This post is brought to you by Holistic Numerical Methods: Numerical Methods for the STEM undergraduate at http://numericalmethods.eng.usf.edu

## Finding the optimum polynomial order to use for regression

Many a times, you may not have the privilege or knowledge of the physics of the problem to dictate the type of regression model. You may want to fit the data to a polynomial. But then how do you choose what order of polynomial to use.

Do you choose based on the polynomial order for which the sum of the squares of the residuals, Sr is a minimum? If that were the case, we can always get Sr=0 if the polynomial order chosen is one less than the number of data points. In fact, it would be an exact match.

So what do we do? We choose the degree of polynomial for which the variance as computed by

Sr(m)/(n-m-1)

is a minimum or when there is no significant decrease in its value as the degree of polynomial is increased. In the above formula,

Sr(m) = sum of the square of the residuals for the mth order polynomial

n= number of data points

m=order of polynomial (so m+1 is the number of constants of the model)

Let’s look at an example where the coefficient of thermal expansion is given for a typical steel as a function of temperature. We want to relate the two using polynomial regression.

 Temperature Instantaneous Thermal Expansion oF 1E-06 in/(in oF) 80 6.47 40 6.24 0 6.00 -40 5.72 -80 5.43 -120 5.09 -160 4.72 -200 4.30 -240 3.83 -280 3.33 -320 2.76

If a first order polynomial is chosen, we get

$alpha=0.009147T+5.999$, with Sr=0.3138.

If a second order polynomial is chosen, we get

$alpha=-0.00001189T^2+0.006292T+6.015$ with Sr=0.003047.

Below is the table for the order of polynomial, the Sr value and the variance value, Sr(m)/(n-m-1)

 Order of polynomial, m Sr(m) Sr(m)/(n-m-1) 1 0.3138 0.03486 2 0.003047 0.0003808 3 0.0001916 0.000027371 4 0.0001566 0.0000261 5 0.0001541 0.00003082 6 0.0001300 0.000325

So what order of polynomial would you choose?

From the above table, and the figure below, it looks like the second or third order polynomial would be a good choice as very little change is taking place in the value of the variance after m=2.

This post is brought to you by Holistic Numerical Methods: Numerical Methods for the STEM undergraduate at http://numericalmethods.eng.usf.edu

## Data for aluminum cylinder in iced water experiment

A colleague asked me what if he did not have time or resources to do the experiments that have been developed at University of South Florida (USF) for numerical methods. He asked if I could share the data taken at USF.

Why not – here is the data for the experiment where an aluminum cylinder is placed in iced water. This link also has the exercises that the students were asked to do.

The temperature vs time data is as follows: (0,23.3), (5,16.3), (10,13), (15,11.8), (20,11), (25,10.7), (30,9.6), (35,8.9), (40,8.4). Time is in seconds and temperature in Celcius. Other data needed is

Ambient temperature of iced water = 1.1oC

Diameter of cylinder = 44.57 mm

Length of cylinder = 105.47 mm

Density of aluminum = 2700 kg/m3

Specific heat of aluminum = 901 J/(kg-oC)

Thermal conductivity of aluminum = 240 W/(m-K)

Table 1. Coefficient of thermal expansion vs. temperature for aluminum (Data taken from http://www.llnl.gov/tid/lof/documents/pdf/322526.pdf by using mid values of temperatures at which CTE is reported)

 Temperature (oC) Coefficient of thermal expansion (μm/m/oC) -10 58 12.5 59 37.5 60 62.5 62 87.5 66 112.5 71

This post is brought to you by Holistic Numerical Methods: Numerical Methods for the STEM undergraduate at http://numericalmethods.eng.usf.edu

## In regression, when is coefficient of determination zero

The coefficient of determination is a measure of how much of the original uncertainty in the data is explained by the regression model.

The coefficient of determination, $r^2$ is defined as

$r^2$=$\frac{S_t-S_r}{S_r}$

where

$S_t$ = sum of the square of the differences between the y values and the average value of y

$S_r$ = sum of the square of the residuals, the residual being the difference between the observed and predicted values from the regression curve.

The coefficient of determination varies between 0 and 1. The value of the coefficient of determination of zero means that no benefit is gained by doing regression. When can that be?

One case comes to mind right away – what if you have only one data point. For example, if I have only one student in my class and the class average is 80, I know just from the average of the class that the student’s score is 80. By regressing student score to the number of hours studied or to his GPA or to his gender would not be of any benefit. In this case, the value of the coefficient of determination is zero.

What if we have more than one data point? Is it possible to get the coefficient of determination to be zero?

The answer is yes. Look at the following data pairs (1,3), (3,-2), (5,4), (7,-5), (9,4.2), (11,3), (2,4). If one regresses this data to a general straight line

y=a+bx,

one gets the regression line to be

y=1.6

In fact, 1.6 is the average value of the given y values. Is this a coincidence? Because the regression line is the average of the y values, $S_t=S_r$, implying $r^2=0$

QUESTIONS

1. Given (1,3), (3,-2), (5,4), (7,a), (9,4.2), find the value of a that gives the coefficient of determination, $r^2=0$. Hint: Write the expression for $S_r$ for the regression line $y=mx+c$. We now have three unknowns, m, c and a. The three equations then are $\frac{\partial S_r} {\partial m} =0$, $\frac{\partial S_r} {\partial c} =0$ and $S_t=S_r$.
2. Show that if n data pairs $(x_1,y_1)......(x_n,y_n)$ are regressed to a straight line, and the regression straight line turns out to be a constant line, then the equation of the constant line is always y=average value of the y-values.

This post is brought to you by Holistic Numerical Methods: Numerical Methods for the STEM undergraduate at http://numericalmethods.eng.usf.edu