Practical regression and anova using r pdf
An inference of causality is often desired but this is usually too much to expect from observational data. See Figure 3. For example, we will observe a positive correlation among the shoe sizes and reading abilities of elementary school students but this relationship is driven by a lurking variable — the age of the child.
In Figure 3. In observational studies, it is important to adjust for the effects of possible confounding variables such as the Z shown in Figure 3. If such variables can be identified, then at least their effect can be interpreted.
Unfortunately, one can never be sure that the all relevant Z have been identified. What if all relevant variables have been measured? In other words, suppose there are no unidentified lurking variables. Even then the naive interpretation does not work. Interpretation cannot be done separately for each variable.
This is a practical problem because it is not unusual for the predictor of interest, x 1 in this example, to be mixed up in some way with other variables like x 2. This too has problems. Often in practice, individual variables cannot be changed without changing others too. Furthermore, this interpretation requires the specification of the other variables - changing which other variables are included will change the interpretation.
Unfortunately, there is no simple solution. Just to amplify this consider the effect of pop75 on the savings rate in the savings dataset. Imagine the data is observed within the ellipses. If the effect of Z is ignored, a strong positive correlation between X and Y is observed in all three cases.
In panel A, we see that when we allow for the effect of Z by observing the relationship between X and Y separately within each level of Z, that the relationship remains a positive correlation.
In panel B, after allowing for Z, there is no correlation between X and Y, while in panel C, after allowing for Z, the relationship becomes a negative correlation. It is perhaps surprising that pop75 is not significant in this model. However, pop75 is negatively correlated with pop15 since countries with proportionately more younger people are likely to relatively fewer older ones and vice versa.
These two variables are both measuring the nature of the age distribution in a country. When two variables that represent roughly the same thing are included in a regression equation, it is not unusual for one or even both of them to appear insignificant even though prior knowledge about the effects of these variables might lead one to expect them to be important.
Higher values of these variables are both associated with wealthier countries. Compare the coefficients and p-values for pop75 throughout. Notice how the sign and significance change in Table3. We see that no simple conclusion about the effect of pop75 is possible. We must find interpretations for a variety of models. In observational studies, there are steps one can take to make a stronger case for causality: 1.
Try to include all relevant variables 2. Use non-statistical knowledge of the physical nature of the relationship. Try a variety of models - see if a similar effect is observed.
Multiple studies under different conditions can help confirm a relationship. It was many years before other plausible explanations were eliminated. The news media often jump on the results of a single study but one should be suspicious of these one off results.
Publication bias is a problem. Another source of bias is that researchers have a vested interest in obtaining a positive result. There is often more than one way to analyze the data and the researchers may be tempted to pick the one that gives them the results they want. This is not overtly dishonest but it does lead to a bias towards positive results. The history of the study of the link between smoking and lung cancer shows that it takes a great deal of effort to progress beyond the observation of an association to strong evidence of causation.
An alternative approach is recognize that the parameters and their estimates are fictional quantities in most regression situations. Instead concentrate on predicting future values - these may actually be observed and success can then be measured in terms of how good the predictions were. Notice that I have been careful to not to say that we have taken a specific individual and increased their x 1 by 1, rather we have observed a new individual with predictor x 1 1.
Prediction is conceptually simpler since interpretation is not an issue but you do need to worry about extrapolation. Quantitative extrapolation: Is the new x 0 within the range of validity of the model.
Is it close to the range of the original data? If not, the prediction may be unrealistic. Confidence intervals for predictions get wider as we move away from the data.
However, this widening does not reflect the possibility that the structure of the model itself may change as we move into new territory. The uncertainty in the parametric estimates is allowed for but not uncertainty about the model itself.
Qualitative extrapolation: Is the new x 0 drawn from the same population from which the original sample was drawn. If the model was built in the past and is to be used for future predictions, we must make a difficult judgment as to whether conditions have remained constant enough for this to work. Prediction is a tricky business — perhaps the only thing worse than a prediction is no prediction at all.
The good Christian should beware of mathematicians and all those who make empty prophecies. The danger already exists that mathematicians have made a covenant with the devil to darken the spirit and confine man in the bonds of Hell. In other words, what if the X we see is not the X used to generate Y? In other words, if the variability in the errors of observation of X are small relative to the range of X then we need not be concerned.
One should not confuse the errors in predictors with treating X as a random variable. For observational data, X could be regarded as a random variable, but the regression inference proceeds conditional on a fixed value for X.
We make the assumption that the Y is generated conditional on the fixed value of X. Contrast this with the errors in predictors case where the X we see is not the X that was used to generate the Y. Here we generate some artificial data from a known model so we know the true values of the parameters and we can tell how well we do: runif generates uniform random numbers and rnorm generates standard normal random numbers.
Because you will get different random numbers, your results will not exactly match mine if you try to duplicate this. What happens when we add some noise to the predictor? We can plot all this information in Figure 4. The regression lines for the no measurement error, small error and large error are shown as solid, dotted and dashed lines respectively.
Median Mean 3rd Qu. Chapter 5 Generalized Least Squares 5. The data originally appeared in Longley Fit a linear model. What do you notice? In data collected over time such as this, successive errors could be correlated. Our initial estimate is 0.
This process would be iterated until conver- gence. The nlme library contains a GLS fitting function. Error t-value p-value Intercept Weighted least squares WLS can be used in this situation. Cases with low variability should get a high weight, high variability a low weight. Some examples 1. The experiment was designed to test certain theories about the nature of the strong interaction.
The cross-section crossx variable is believed to be linearly related to the inverse of the energy energy - has already been inverted. At each level of the momentum, a very large num- ber of observations were taken so that it was possible to accurately estimate the standard deviation of the response sd. Unweighted is dashed. Recompute the weights and goto 2.
Continue until convergence. Also how many degrees of freedom do we have? More details may be found in Carroll and Ruppert An alternative approach is to model the variance and jointly estimate the regression and weighting parameters using likelihood based method. This can be implemented in R using the gls function in the nlme library. Chapter 6 Testing for Lack of Fit How can we tell if a model fits the data?
An example can be seen in Figure 6. Recall from Section 3. Continuing with the same data as in the weighted least squares example we test to see if a linear model is adequate.
In this example, we know the variance almost exactly because each response value is the average of a large number of observations.
However, we would get essentially the same result as the following analysis. Now plot the data and the fitted regression line shown as a solid line on Figure 6. Just because R 2 is large does not mean that you can not do better. This seems clearly more appropriate than the linear model. We can do this if we have repeated y for one or more fixed x. They cannot just be repeated measurements on the same subject or unit.
Such repeated measures would only reveal the within subject variability or the measurement error. This model is just the one-way anova model if you are familiar with that. Comparing this model to the regression model amounts to the lack of fit test. This is usually the most convenient way to compute the test but if you like we can then partition the RSS into that due to lack of fit and that due to the pure error as in Table 6.
Because the model of interest represents a special case of the saturated model where the saturated parameters satisfy the constraints of the model of interest, we can use the standard F-testing methodology. The specimens were submerged in sea water for 60 days and the weight loss due to corrosion was recorded in units of milligrams per square decimeter per day.
The data come from Draper and Smith We now fit a model that reserves a parameter for each group of data with the same value of x. This is accomplished by declaring the predictor to be a factor. Before considering other models, I would first find out whether the replicates are genuine — perhaps the low pure error SD can be explained by some correlation in the measurements.
Another possible explanation is unmeasured third variable is causing the lack of fit. When there are replicates, it is impossible to get a perfect fit. Even when there is parameter assigned to each group of x-values, the residual sum of squares will not be zero. For the factor model above, the R 2 is These methods are good for detecting lack of fit, but if the null hypothesis is accepted, we cannot conclude that we have the true model.
After all, it may be that we just did not have enough data to detect the inadequacies of the model. All we can say is that the model is not contradicted by the data. When there are no replicates, it may be possible to group the responses for similar x but this is not straightforward. It is also possible to detect lack of fit by less formal, graphical methods. A more general question is how good a fit do you really want? By increasing the complexity of the model, it is possible to fit the data more closely.
By using as many parameters as data points, we can fit the data exactly. Very little is achieved by doing this since we learn nothing beyond the data itself and any predictions made using such a model will tend to have very high variance. The question of how complex a model to fit is difficult and fundamental. There is no plausible reason corrosion loss should suddenly drop at 1.
This is a consequence of overfitting the data. This illustrates the need not to become too focused on measures of fit like R 2. The first model we try may prove to be inadequate. Regression diagnostics are used to detect problems with the model and suggest improve- ments. This is a hands-on process. The hi depends only on X — knowledge of y is required for a full interpretation.
Large values of hi are due to extreme values in X. We first make up a character vector of the country names using row. When you are done, click on the middle if not available, the right mouse button. I have identified Chile and Zambia on the plot. Now look at the leverage: We first extract the X-matrix here using model.
Which countries have large leverage? Alternatively, we can do it interactively like this identify ,lev,countries I have identified Libya and the United States as the points with the highest leverage. If there is some underlying heteroscedascity in the errors, studentization cannot correct for it. Which residuals are large? In this case, there is not much difference between the studentized and raw residuals apart from the scale. Only when there is unusually large leverage will the differences be noticeable.
We need to be aware of such exceptions. An outlier test is useful because it enables us to distinguish between truly unusual points and residuals which are large but not exceptional. Outliers may effect the fit — see Figure 7. The two additional points marked points both have high leverage because they are far from the rest of the data. Only when we compute the fit without that point do we get a large residual.
The solid line is the fit including the point but not the point. The dotted line is the fit without either additional point and the dashed line is the fit with the point but not the point.
How large is large? Since ti tn p 1 and we can calculate a p-value to test whether case i is an outlier. However, we are likely to want to test all cases so we must adjust the level of the test accordingly.
This method is called the Bonferroni correction and is used in contexts other than outliers as well. The larger that n is, the more conservative it gets. Notes 1. Two or more outliers next to each other can hide each other. An outlier in one model may not be an outlier in another when the variables have been changed or transformed. You will usually need to reinvestigate the question of outliers when you change the model. The error distribution may not be normal and so larger residuals may be expected.
For example, day-to-day changes in stock indices seem mostly normal but large changes occur not infrequently. Individual outliers are usually much less of a problem in larger datasets. For large datasets, we need only worry about clusters of outliers.
Such clusters are less likely to occur by chance and more likely to represent actual structure. Finding these cluster is not always easy. What should be done about outliers? Check for a data entry error first. These are relatively common. Unfortunately, the original source of the data may have been lost.
Examine the physical context - why did it happen? Sometimes, the discovery of an outlier may be of singular interest. Some scientific discoveries spring from noticing unexpected aberrations. Another example of the importance of outliers is in the statistical analysis of credit card transactions. Outliers in this case may represent fraudulent use. Exclude the point from the analysis but try reincluding it later if the model is changed. The exclusion of one or more points may make the difference between getting a statistical significant result or having some unpublishable research.
This can lead to difficult decision about what exclusions are reasonable. To avoid any suggestion of dishonesty, always report the existence of outliers even if you do not include them in your final model. NASA launched the Nimbus 7 satellite to record atmospheric information. After several years of operation in , the British Antartic Survey observed a large decrease in atmospheric ozone over the Antartic. On further examination of the NASA data, it was found that the data processing program automatically discarded observations that were extremely low and assumed to be mistakes.
Thus the discovery of the Antartic ozone hole was delayed several years. Perhaps, if this had been known earlier, the CFC phaseout would have been agreed earlier and the damage could have been limited.
Here is an example of a dataset with multiple outliers. Data are available on the log of the surface temperature and the log of the light intensity of 47 stars in the star cluster CYG OB1, which is in the direction of Cygnus. Is this what you expected? Are there any outliers in the data? The outlier test does not reveal any. The four stars on the upper left of the plot are giants.
We can visualize the problems here, but for higher dimensional data this is much more difficult. An influential point may or may not be an outlier and may or may not have large leverage but it will tend to have at least one of those two properties.
In Figure 7. The combination of the two leads to influence. An index plot of Di can be used to identify influential points.
I have identified the largest three values. Libya Ireland Change in pop75 coef 0. Try this for the other estimates - which countries stick out?
Notice that the ddpi term is no longer significant and that the the R 2 value has decreased a lot. Things to look for are heteroscedascity non-constant variance and nonlinearity which indicates some change in the model is necessary. No problem Heteroscedascity Nonlinear 0. Look for the same things except in the case of plots against predictors not in the model, look for any relationship which might indicate that this predictor should be included.
What do you see? The latter plot is designed to check for non-constant variance only. It folds over the bottom half of the first plot to increase the resolution for detecting non-constant variance.
The first plot is still needed because non-linearity must be checked. The following four for loops show 1. Constant Variance 2. Strong non-constant variance 3. Mild non-constant variance 4. Repeat to get an idea of the usual amount of variation. Repeated generation of plots under a model where there is no violation of the assumption that the diagnostic plot is designed to check is helpful in making this judgement.
Can you see the two groups? Given two independent samples from normal distributions, we can test for equal variance using the test statistic of the ratio of the two variance. The null distribution is a F with degrees of freedom given by the two samples.
Alternatively, one can transform y to h y where h is chosen so that var h y is constant. Sometimes y i 0 for some i in which case the transformations may choke. Our guess at a variance stabilizing transformation worked out here, but had it not, we could always have tried something else. The square root transformation is often appropriate for count response data.
The poisson distribution is a good model for counts and that distribution has the property that the mean is equal to the variance thus suggesting the square root transformation.
It might be even better to go with a poisson regression rather than the normal-based regression we are using here. A formal test may be good at detecting a particular kind of non-constant variance but have no power to detect another. Residual plots are more versatile because unanticipated problems may be spotted. Graphical techniques are usually more effective at revealing structure that you may not have suspected.
Of course, sometimes the interpretation of the plot may be ambiguous but at least one can be sure that nothing is seriously wrong with the assumptions. For this reason, I usually prefer a graphical approach to diagnostics. We can look at 1. Plots of y against each xi. Partial Regression or Added Variable plots can help isolate the effect of x i on y. Partial Residual plots are a competitor to added variable plots. The partial regression plot also provides some intuition about the meaning of regression coefficients.
We are looking at the marginal relationship between the response and the predictor after the effect of the other predictors has been removed. Multiple regression is difficult because we cannot visualize the full relation- ship because of the high dimensionality. The partial regression plot allows us to focus on the relationship between one predictor and the response, much as in simple regression. The graphical analysis has shown a relationship in the data that a purely numerical analysis might easily have missed.
Higher dimensional plots can also be useful for detecting structure that cannot be seen in two dimensions. These are interactive in nature so you need to try them to see how they work. Two ideas are 1. Spinning - 3D plots where color, point size and rotation are used to give illusion of a third dimension.
Brushing - Two or more plots are linked so that point which are brushed in one plot are highlighted in another. Certainly there are communication difficulties as these plots cannot be easily printed.
Many statistical packages allow for this kind of investigation. See www. The residuals can be assessed for normality using a Q-Q plot.
The steps are: 1. Because these residuals have been normalized, they should lie along a 45 degree line. I generate data from different distributions: 1. Normal 2. Lognormal - an example of a skewed distribution 3. Cauchy - an example of a long-tailed platykurtic distribution 4. This is a useful trick when you want to experiment with changing these settings. The consequences of non-normality are 1. However, it has been shown that only really long- tailed distributions cause a problem.
Mild non-normality can safely be ignored and the larger the sample size the less troublesome the non-normality. What to do? A transformation of the response may solve the problem - this is often true for skewed errors.
Other changes in the model may help. Accept non-normality and base the inference on the assumption of another distribution or use resam- pling methods such as the bootstrap or permutation tests. Alternatively use robust methods which give less weight to outlying points. This is appro- priate for long tailed distributions. For short-tailed distributions, the consequences of non-normality are not serious and can reasonably be ignored.
There are formal tests for normality such as the Kolmogorov-Smirnov test but these are not as flexible as the Q-Q plot. The p-value is not very helpful as an indicator of what action to take. For smaller sample sizes, formal tests lack power. The idea is to plot the data against the positive normal quantiles The steps are: 1. Plot x i against ui.
We are looking for outliers which will be apparent as points that diverge substantially from the rest of the data. The halfnorm function comes from the book library.
Libya Libya 0. For this type of data, it is wise to check the uncorrelated assumption. Use formal tests like the Durbin-Watson or the run test. If you do have correlated errors, you can use GLS. Such problems are common in Econometrics.
For the example, we use some taken from an environmental study that measured the four variables ozone, solar radiation, temperature, and wind speed for consecutive days in New York. R Wind Temp Month Day 1 41 7. We notice that there are some missing values. The default behavior in R when performing a regression with missing values is to exclude any case that contains a missing value.
The missing values in the data were not used in the construction of the model but this also breaks up the sequential pattern in the data. I get round this by reintroducing missing values into the residuals corresponding to the omitted cases. We see that there is no significant correlation. You can plot more than just successive pairs if you suspect a more complex dependence.
For spatial data, more complex checks are required. We may also consider adding additional predictors that are functions of the existing predictors like quadratic or crossproduct terms. So the use of standard regression methods for the logged response model requires that we believe that the errors enter multiplicatively in the original scale.
As a practical matter, we often do not know how the errors enter the model, additively, multiplicatively or otherwise. The usual approach is to try different transforms and then check the residuals to see whether they satisfy the conditions required for linear regression. Unless you have good information that the error enters in some particular way, this is the simplest and most appropriate way to go. Although you may transform the response, you will probably need to express predictions in the orig- inal scale.
This is simply a matter of back-transforming. This interval will not be symmetric but this may be desirable — remember what happened with the prediction confidence intervals for Galapagos data. There is no straightforward way of backtransforming them to values that can interpreted in the original scale. You cannot directly compare regression coefficients for models where the response transformation is different.
Difficulties of this type may dissuade one from transforming the response even if this requires the use of another type of model such as a generalized linear model. The Box-Cox method is a popular way to determine a tranformation on the response. It is designed for strictly positive responses and chooses the transformation to find the best fit to the data. We can see that there is no good reason to transform. We see that perhaps a cube-root transformation might be best here.
A square root is also a possibility as this falls just within the confidence intervals. Certainly there is a strong need to transform. What if some yi 0? Sometimes adding a constant to all y can work provided that constant is small. However, this takes time and furthermore the correct transformation for each predictor may depend on getting the others right too.
Partial residuals are a good way of finding suggestions for transforming the predictors 8. For example, in the analysis of the savings data, we observed that there were two groups in the data and we might want to fit a different model to the two parts.
Suppose we focus attention on just the pop15 predictor for ease of presentation. The two fits are shown in Figure 8. If we believe the fit should be continuous as the predictor varies, then this is unsatisfactory. One solution to this problem is the broken stick regression fit. B l and Br form a first-order spline basis with a knotpoint at c. Sometimes Bl and Br are called hockey-stick functions because of their shape. The two linear parts are guaranteed to meet at c. Notice that this model uses only three parameters in contrast to the four total parameters used in the subsetted regression illustrated above.
A parameter has been saved by insisting on the continuity of the fit at c. The intercept of this model is the value of the response at the join. We might question which fit is preferable in this particular instance. For the high pop15 countries, we see that the imposition of continuity causes a change in sign for the slope of the fit. We might argue that the two groups of countries are so different and that there are so few countries in the middle region, that we might not want to impose continuity at all.
We can have more than one knotpoint simply by defining more pairs of basis functions with different knotpoints. Broken stick regression is sometimes called segmented regression. Allowing the knotpoints to be parameters is worth considering but this will result in a nonlinear model. There are two ways to choose d: 1. Keep adding terms until the added term is not statistically significant. Start with a large d — eliminate not statistically significant terms starting with the highest order term.
Warning: Do not eliminate lower order terms from the model even if they are not statistically significant. An additive change in scale would change the t-statistic of all but the highest order term. We would not want the conclusions of our study to be so brittle to such changes in the scale which ought to be inconsequential.
What do you notice about the other p-values? Why do we find a quadratic model when the previous analysis on transforming predictors found that the ddpi variable did not need transformation? Check that starting from a large model including the fourth power and working downwards gives the same result.
Since there is often no necessary importance to zero on a scale of measurement, there is no good reason to remove the linear term in this model but not in the previous version. No advantage would be gained. You have to refit the model each time a term is removed and for large d there can be problem with numerical stability. The z are called orthogonal polynomials. The value of orthogonal polynomials has declined with advances in computing speeds al- though they are still worth knowing about because of their numerical stability and ease of use.
The poly function constructs Orthogonal polynomials. This is because the power functions used for the polynomials take non-zero values across the whole range of the predictor. In contrast, the broken stick regression method localizes the influence of each data point to its particular segment which is good but we do not have the same smoothness as with the polynomials.
There is a way we can combine the beneficial aspects of both these methods — smoothness and local influence — by using B-spline basis functions. A given basis function is non-zero on interval defined by four successive knots and zero elsewhere. This property ensures the local influence property. The basis function is a cubic polynomial for each sub-interval between successive knots 3.
The basis function is continuous and continuous in its first and second derivatives at each knot point. This property ensures the smoothness of the fit. The basis function integrates to one over its support The basis functions at the ends of the interval are defined a little differently to ensure continuity in derivatives at the edge of the interval. We generate the data and display it in Figure 8. We see that order 4 is a clear underfit. Order 12 is much better although the fit is too wiggly in the first section and misses the point of inflection.
We now create the B-spline basis. You need to have three additional knots at the start and end to get the right basis. I have chosen to the knot locations to put more in regions of greater curvature. I have used 12 basis functions for comparability to the orthogonal polynomial fit.
We see that the fit comes very close to the truth. Regression splines are useful for fitting functions with some flexibility provided we have enough data. We can form basis functions for all the predictors in our model but we need to be careful not to use up too many degrees of freedom.
Alternatively, you could imple- ment this using the regression spline bases for each predictor variable. It is important to realize the strengths and weaknesses of regression analysis. For larger data sets with relatively little noise, more recently developed complex models will be able to fit the data better while keeping the number of parameters under control. One relative advantage of regression is that the models are easier to interpret in contrast to techniques like neural networks which are usually only good for predictive purposes.
We might want to do this because 1. Predictors of similar magnitude are easier to compare. A change of units might aid interpretability.
Numerical stability is enhanced when all the predictors are on a similar scale. This is because the regression plane always runs through the point of the averages which because of the centering is now at the origin. Such scaling has the advantage of putting all the predictors and the response on a comparable scale, which makes comparisons simpler. It also allows the coefficients to be viewed as kind of partial correlation — the values will always be between -1 and 1.
It also avoids some numerical problems that can arise when variables are of very different scales. The downside of this scaling is that the regression coefficients now represent the effect of a one standard unit increase in the predictor on the response in standard units — this might not always be easy to interpret. One purpose for principal components is to transform the X to orthogonality. For example, consider the case with two predictors depicted in Figure 9.
The original predictors, x1 and x2 , are clearly correlated and so the X-matrix will not be orthogonal. This will complicate the interpretation of the effects of x 1 and x2 on the response. Suppose we rotate the coordinate axes so that in the new system, the predictors are orthogonal. These rotated directions, z 1 and z2 in our two predictor example, are simply linear combinations of the original predictors. This is the geometrical description of principal components. We now indicate how these directions may be calculated.
Zero eigenvalues indicate non-identifiability. The columns of Z are called the principal components and these are orthogonal to each other. Another way of looking at it is to try to find the linear combinations of X which have the maximum variation. Ideally, only a few eigenvalues will be large so that almost all the variation in X will be representable by those first few principal components.
Principal components can be effective in some situations but 1. The principal components are linear combinations of the predictors — little is gained if these are not interpretable. Generally the predictors have to be measurements of comparable quantities for interpretation to be possible 2. Principal components does not use the response. There are variations which involve not using the intercept in X T X or using the correlation matrix of the predictors instead of X T X.
Now examining the X-matrix. What are the scales of the variables? It might make more sense to standardize the predictors before trying principal components.
Figure 9. Here the elbow is at 3 telling us that we need only consider 2 principal components. One advantage of principal components is that it transforms the predictors to an orthogonal basis. Because the directions of the eigenvectors are set successively in the greatest remaining direction of variation in the predictors, it is natural that they be ordered in significance in predicting the response. However, there is no guarantee of this — we see here that the 5th eigenvectors is significant while the fourth is not even though there is about six times greater variation in the fourth direction than the fifth.
In this example, it hardly matters since most of the variation is explained by the earlier values, but look out for this effect in other dataset in the first few eigenvalues. Look back at the first eigenvector - this is roughly a linear combination of all the standardized variables. Now plot each of the variables as they change over time — see Figure 9.
This suggests we identify the first principal component with a time trend effect. The second principal component is roughly a contrast between numbers unemployed and in the armed forces. Forces Unemployed Year Year Population Year Year Year Figure 9.
Forces ,longley Coefficients: Estimate Std. Forces We could do more work on the other principal components. This is illustrates a typical use of principal components for regression. Some intuition is required to form new variables as combinations of older ones.
This is the typical multiple regression setting. See Figure 9. Regress Y on each Xi in turn to get b1i. Regress Y on T1 and each Xi on T1. The residuals from these regressions have the effect of T1 removed.
Replace Y and each Xi by the residuals of each corresponding regression. Go back to step one updating the index. There are various choices for the weighting scheme: 1.
This is the most common choice. The variances of the b i j are then inversely propor- tional to varX j which does make some sense. The algorithm ensures that the components Ti are uncorrelated just as the principal components are uncorrelated. For this example, we again use the Longley data.
We will not need intercept terms in any of the regres- sions because of the centering. Now for later comparison, we have the regression on just first PC. Now we compute the second component of the PLS. We need to regress out the effect of the first component and then use the same computational method as above. Now compare this fit to the two component PCR.
Notes: The tricky part is choosing how many components are required. Crossvalidation is a possible way of selecting the number of components. There are other faster versions of the algorithm described above but these generally provide less insight into the method.
PLS has been criticized as an algorithm that solves no well-defined modeling problem. PLS has the biggest advantage over ordinary least squares and PCR when there are large numbers of variables relative to the number of case. PCR and PLS compared PCR attempts to find linear combinations of the predictors that explain most of the variation in these predictors using just a few components. The purpose is dimension reduction.
Because the principal compo- nents can be linear combinations of all the predictors, the number of variables used is not always reduced. Because the principal components are selected using only the X-matrix and not the response, there is no definite guarantee that the PCR will predict the response particularly well although this often happens. If it happens that we can interpret the principal components in a meaningful way, we may achieve a much simpler explanation of the response.
Thus PCR is geared more towards explanation than prediction. In contrast, PLS finds linear combinations of the predictors that best explain the response. It is most effective when ther are large numbers of variables to be considered. If successful, the variablity of prediction is substantially reduced.
On the other hand, PLS is virtually useless for explanation purposes. If X T X is close to singular, we have approximate collinearity or multicollinearity some just call it collinearity. Collinearity can be detected in several ways: 1. Examination of the correlation matrix of the predictors will reveal large pairwise collinearities. A regression of xi on all other predictors gives R2i.
Repeat for all predictors. R2i close to one indicates a problem — the offending linear combination may be found. Examine the eigenvalues of X T X - small eigenvalues indicate a problem. Collinearity makes some of the parameters hard to estimate. As an aside, the variance of the first principal component is maximized and so the variance of the corresponding regression coefficient will tend to be small. Another consequence of this equation is that it tells us what designs will minimize the variance of the regression coefficients if we have the ability to place the X.
Also we can maximize S x j x j by spreading X as much as possible. The maximum is attained by placing half the points at the minimum practical value and half at the maximum. Unfortunately, this design assumes the linearity of the effect and would make it impossible to check for any curvature.
So, in practice, most would put some design points in the middle of the range to allow checking of the fit. If R2j is close to one then the variance inflation factor 1 1R2 will be large.
Collinearity leads to 1. Three of the predictors have large p-values but all are variables that might be expected to affect the response. This means that problems are being caused by more than just one linear combination. Now check out the variance inflation factors. How can we get rid of this problem? One way is to throw out some of the variables. Examine the full correlation matrix above. Imagine a table — as two diagonally opposite legs are moved closer together, the table becomes increasing unstable.
The effect of collinearity on prediction depends on where the prediction is to be made. The greater the distance from the observed data, the more unstable the prediction.
Distance needs to be considered in a Mahalanobis sense rather than Euclidean. One cure for collinearity is amputation — too many variables are trying to do the same job of explaining the response. Suppose that the predictors have been centered by their means and scaled by their standard deviations and that the response has been centered.
This is illustrated in Figure 9. The least squares estimate is at the center of the ellipse while the ridge regression is the point on the ellipse closest to the origin. The ellipse is a contour of equal density of the posterior probability, which in this case will be comparable to a confidence ellipse. Use of Lagrange multipliers leads to ridge regression. We demonstrate the method on the Longley data. The topmost curve represent the coefficient for year. The dashed line that starts well below zero but ends above is for GNP.
The ridge trace plot is shown in Figure 9. Forces Population Year 0. Consider the change in the coefficient for GNP. For the least squares fit, the effect of GNP is negative on the response - number of people employed.
Bias is undesirable but there are other considera- tions. Sometimes a large reduction in the variance may obtained at the price of an increase in the bias. If the MSE is reduced as a consequence then we may be willing to accept some bias. This is the trade-off that Ridge Regression makes - a reduction in variance at the price of an increase in bias.
This is a common dilemma. But why bother? We want to explain the data in the simplest way — redundant predictors should be removed. Applied to regression analysis, this implies that the smallest model that fits the data is best.
Unnecessary predictors will add noise to the estimation of other quantities that we are interested in. Degrees of freedom will be wasted.
Collinearity is caused by having too many variables trying to do the same job. Prior to variable selection: 1. Identify outliers and influential points - maybe exclude them at least temporarily. Add in any transformations of the variables that seem appropriate.
For example, in polynomial models, x 2 is a higher order term than x. When selecting variables, it is important to respect the hierarchy. Lower order terms should not be removed from the model before higher order terms in the same variable. There two common situations where this situation arises: Polynomials models.
Scale changes should not make any important change to the model but in this case an additional term has been added. This is not good. This illustrates why we should not remove lower order terms in the presence of higher order terms. We would not want interpretation to depend on the choice of scale.
Often this hypothesis is not meaningful and should not be considered. Only when this hypothesis makes sense in the context of the particular problem could we justify the removal of the lower order term.
Models with interactions. A joint removal would correspond to the clearly meaningful comparison of a quadratic surface and linear one. Just removing the x 1 x2 term would correspond to a surface that is aligned with the coordinate axes. Skip to search form Skip to main content Skip to account menu You are currently offline. Some features of the site may not work correctly. Faraway Published Computer Science Permission to reproduce individual copies of this book for personal use is granted.
Multiple copies may be created for nonprofit academic purposes — a nominal charge to cover the expense of reproduction may be made. Reproduction for profit is prohibited without permission. Preface There are many books on regression and analysis of variance. These books expect different levels of preparedness and place different emphases on the material. This book is not introductory. It presumes some knowledge… Expand.
Save to Library Save. Create Alert Alert. Share This Paper. Background Citations. Methods Citations. Figures, Tables, and Topics from this paper. Citation Type. Has PDF. Publication Type. More Filters. Introduction This document has been conceived as a supplemental reference material to accompany the excellent book of Douglas C.
Montgomery, Design and Analysis of Experiments hereafter abbreviated … Expand. View 2 excerpts, cites methods. View 2 excerpts, cites background. Consider a problem of predicting a response variable using a set of covariates in a linear regression model.
If it is ap rioriknown or suspected that a subset of the covariates do not significantly … Expand. In Ken Brewer's book, Brewer , he made a strong argument for the ubiquitous presence of a specific range of the coefficient of heteroscedasticity when modeling for 'survey populations. View 1 excerpt, cites methods. Statistical Inference with R. A statistical approach to estimating effects of performance shaping factors on human error probabilities of soft controls.
Engineering, Computer Science. Highly Influenced. View 4 excerpts, cites methods and background.
0コメント