**Statistics****`****LinearRegression****`**

The built-in function Fit finds the least-squares fit to a data set as a linear combination of given basis functions. This is precisely what one needs for linear regression. The functions Regress and DesignedRegress provided in this package augment Fit by giving a list of commonly used statistics such as RSquared, EstimatedVariance, and ANOVATable. The output of the regression functions can be controlled so that only needed information is produced.

Linear regression finds the linear combination of basis functions that best approximates the data in terms of least squares. The basis functions, , define the model by specifying the predictors as functions of the independent variables. The resulting model is , where is the

response, is the

basis function evaluated at the

case, and is the error of the

case.

Estimates of the coefficients are calculated to minimize , the residual sum of squares. For example, simple linear regression is accomplished by defining the basis functions as and , in which case and are found to minimize

.

Using Regress.

The arguments of Regress are of the same form as those of Fit. The data can be a list of vectors, each vector consisting of the observed values of the independent variables and the associated response. The basis functions must be functions of the symbols given as variables. These symbols correspond to the independent variables represented in the data.

The data can also be a vector of data points. In this case, Regress assumes that this vector represents the values of a response variable with the independent variable having values ,

, ... .

Ways of specifying data in Regress.

This loads the package.
In[1]:= **<<Statistics`LinearRegression`**

Here is a set of data. The first element in each pair gives the value of the independent variable, while the second gives the observed response.
In[2]:= **data = {{0.055, 90}, {0.091, 97}, {0.138, 107},**

{0.167, 124}, {0.182, 142}, {0.211, 150},

{0.232, 172}, {0.248, 189}, {0.284, 209},

{0.351, 253}};

This is a plot of the data.
In[3]:= **dplot = ListPlot[data]**

This is the regression output for fitting the model . Chop replaces the p-values below with 0

.
In[4]:= **(regress = Regress[data, {1, x^2}, x];**

Chop[regress, 10^(-6)])

Out[4]=

You can use Fit if you only want the fit function.
In[5]:= **func = Fit[data, {1, x^2}, x]**

Out[5]=

Options for Regress.

Two of the options of Regress influence the method of calculation. IncludeConstant has a default setting True, which causes a constant term to be added to the model even if it is not specified in the basis functions. To fit a model without this constant term, specify IncludeConstant->False and do not include a constant in the basis functions.

The Weights option allows you to implement weighted least squares by defining a vector of weights for the error variances. The errors are assumed to be uncorrelated, with unequal variances given by , where is the weight associated with the

response. When Weights

->

, ... ,

, the parameters are chosen to minimize the weighted sum of squared residuals .

The options RegressionReport and BasisNames affect the form and content of the output. If you do not specify any options, Regress automatically produces a list including values for ParameterTable, RSquared, AdjustedRSquared, EstimatedVariance, and ANOVATable. This set of statistics comprises the default SummaryReport. The option RegressionReport can be used to specify a single statistic or a list of statistics so that more (or less) than the default set of statistics is included in the output. RegressionReportValues[Regress] gives the values that may be included in the RegressionReport list for the Regress function.

With the option BasisNames, you can label the headings of predictor variables in tables such as ParameterTable and ParameterCITable.

The regression functions will also accept any option that can be specified for SingularValues or StudentTCI. In particular, the numerical tolerance for the internal singular value decomposition is specified using Tolerance, and the confidence level for hypothesis testing and confidence intervals is specified using ConfidenceLevel

.

Possible settings for RegressionReport or settings that may be included in a list specified by RegressionReport.

ANOVATable, a table for analysis of variance, provides a comparison of the given model to a smaller one including only a constant term. If IncludeConstant->False is specified, then the smaller model is reduced to the data. The table includes the degrees of freedom, the sum of squares and the mean squares due to the model (in the row labeled Model) and due to the residuals (in the row labeled Error). The residual mean square is also available in EstimatedVariance, and is calculated by dividing the residual sum of squares by its degrees of freedom. The -test compares the two models by the ratio of their mean squares. If the value of is large, the null hypothesis supporting the smaller model is rejected.

To evaluate the importance of each basis function, you can get information about the parameter estimates from the parameter table obtained by setting RegressionReport to ParameterTable, or by including ParameterTable in the list specified by RegressionReport. This table includes the estimates, their standard errors, and -statistics for testing whether each parameter is zero. The -values are calculated by comparing the obtained statistic to the distribution with degrees of freedom, where is the sample size and is the number of predictors. Confidence intervals for the parameter estimates, also based on the distribution, can be found by specifying ParameterCITable. ParameterConfidenceRegion specifies the ellipsoidal joint confidence region of all fit parameters. ParameterConfidenceRegion[

,

, ...

] specifies the joint conditional confidence region of the fit parameters associated with basis functions

,

, ... , a subset of the complete set of basis functions.

The square of the multiple correlation coefficient is called the coefficient of determination , and is given by the ratio of the model sum of squares to the total sum of squares. It is a summary statistic that describes the relationship between the predictors and the response variable. AdjustedRSquared is defined as , and gives an adjusted value that you can use to compare subsequent subsets of models. The coefficient of variation is given by the ratio of the residual root mean square to the mean of the response variable. If the response is strictly positive, this is sometimes used to measure the relative magnitude of error variation.

Each row in MeanPredictionCITable gives the confidence interval for the mean response for the corresponding predictor variable(s). Each row in SinglePredictionCITable gives the confidence interval for the response of a single observation of the corresponding predictor variable(s). MeanPredictionCITable gives a region likely to contain the regression curve, while SinglePredictionCITable

gives a region likely to contain all possible observations.

In this example, only the residuals, the confidence interval table for the predicted response of single observations, and the parameter joint confidence region are produced.
In[6]:= **regress = Regress[data, {1, x^2}, x,**

RegressionReport ->

{FitResiduals, SinglePredictionCITable,

ParameterConfidenceRegion}]

Out[6]=

This is a list of the residuals extracted from the output.
In[7]:= **errors = FitResiduals /. regress**

Out[7]=

The observed response, the predicted response, the standard errors of the predicted response, and the confidence intervals may also be extracted.
In[8]:= **{observed, predicted, se, ci} =**

Transpose[(SinglePredictionCITable /. regress)[[1]]];

You can now plot the predicted values against the residuals for diagnostic purposes.
In[9]:= **ListPlot[Transpose[{predicted, errors}]]**

Here the predicted values and lower and upper confidence limits are paired with the corresponding

values.
In[10]:= **(xval = Map[First, data];**

predicted = Transpose[{xval, predicted}];

lowerCI = Transpose[{xval, Map[First, ci]}];

upperCI = Transpose[{xval, Map[Last, ci]}]);

This loads the function MultipleListPlot.
In[11]:= **<<Graphics`MultipleListPlot`**

This displays the raw data, fitted curve, and the

confidence intervals for the predicted values of single observations.
In[12]:= **MultipleListPlot[**

data, predicted, lowerCI, upperCI,

SymbolShape -> {PlotSymbol[Diamond], None, None,

None},

PlotJoined -> {False, True, True, True},

PlotStyle -> {Automatic, Automatic,

Dashing[{.05, .05}], Dashing[{.05, .05}]}

]

The functions Show and Graphics may be used to display an Ellipsoid object. This is the joint

confidence region of the regression parameters.
In[13]:= **Show[Graphics[ParameterConfidenceRegion /. regress],**

Axes -> True, AxesLabel -> {"Constant","x Squared"}]

This package provides numerous diagnostics for evaluating the data and the fit. The HatDiagonal gives the leverage of each point, measuring whether each observation of the predictor variables is unusual. CookD and PredictedResponseDelta are influence diagnostics, simultaneously measuring whether the predictor variables and the response variable are unusual. Unfortunately, these diagnostics are primarily useful in detecting single outliers. In particular, the diagnostics may indicate a single outlier, but deleting that observation and recomputing the diagnostics may indicate others. All of these diagnostics are subject to this "masking" effect. They are described in greater detail in Regression Diagnostics: Identifying Influential Data and Sources of Collinearity, by D. A. Belsley, E. Kuh, and R. E. Welsch (John Wiley & Sons, 1980), and "Detection of influential observations in linear regression", by R. D. Cook, Technometrics, 19, 1977.

Diagnostics for detecting outliers.

Some diagnostics indicate the degree to which individual basis functions contribute to the fit, or whether the basis functions are involved in a collinear relationship. The sum of the elements in the SequentialSumOfSquares vector gives the model sum of squares listed in the ANOVATable. Each element corresponds to the increment in the model sum of squares obtained by sequentially adding each (nonconstant) basis function to the model. Each element in the PartialSumOfSquares vector gives the increase in the model sum of squares due to adding the corresponding (nonconstant) basis function to a model consisting of all other basis functions. SequentialSumOfSquares is useful in determining the degree of a univariate polynomial model, while PartialSumOfSquares is useful in trimming a large set of predictor variables. VarianceInflation or EigenstructureTable may also be used for predictor set trimming.

Diagnostics for evaluating basis functions and detecting collinearity.

The Durbin-Watson

statistic is used for testing the existence of a first-order autoregressive process. A value close to 2 indicates uncorrelated errors, an underlying assumption of the regression model.

Correlated errors diagnostic.

Other statistics not mentioned here can be computed with the help of the "catcher" matrix. This matrix catches all the information the predictor variables have about the parameter vector. This matrix can be exported from Regress by specifying CatcherMatrix with the RegressionReport option.

Matrix describing the parameter information provided by the predictors.

Frequently, linear regression is applied to an existing design matrix rather than the original data. A design matrix is a list containing the basis functions evaluated at the observed values of the independent variable. If your data are already in the form of a design matrix with a corresponding vector of response data, you can use DesignedRegress for the same analyses as provided by Regress. DesignMatrix puts your data in the form of a design matrix.

Functions for linear regression using a design matrix.

DesignMatrix takes the same arguments as Regress. It can be used to get the necessary arguments for DesignedRegress, or to check whether you correctly specified your basis functions. When you use DesignMatrix, the constant term is always included in the model unless IncludeConstant->False is specified. Every option of Regress except IncludeConstant is accepted by DesignedRegress. RegressionReportValues[DesignedRegress] gives the values that may be included in the RegressionReport list for the DesignedRegress function.

This is the design matrix used in the previous regression analysis.
In[14]:= **mat = DesignMatrix[data, {1, x^2}, x]**

Out[14]=

Here is the vector of observed responses.
In[15]:= **response = Map[Last, data]**

Out[15]=

The result of DesignedRegress is identical to that of Regress. Note that the predictor names that were specified for the output appear in the ParameterTable.
In[16]:= **DesignedRegress[mat, response, BasisNames ->**

{"Constant","x Squared"}] // Chop[#, 10^(-6)]&

Out[16]=

Linear regression using the singular value decomposition of the design matrix.

DesignedRegress will also accept the singular value decomposition of the design matrix. If the regression is not weighted, this approach will save recomputing the design matrix decomposition.

This is the singular value decomposition of the design matrix.
In[17]:= **svd = SingularValues[mat];**

When several responses are of interest, this will save recomputing the design matrix decomposition.
In[18]:= **DesignedRegress[svd, response,**

RegressionReport -> BestFitParameters]

Out[18]=